sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

Too slow "Column Pair Trends" #546

Closed echatzikyriakidis closed 5 months ago

echatzikyriakidis commented 5 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

Hi @npatki,

It seems that , it is too slow when running Column Pair Trends from Quality Report.

My current example:

Generating report ... (1/4) Evaluating Column Shapes: : 100%|██████████| 59/59 [03:39<00:00, 3.72s/it] (2/4) Evaluating Column Pair Trends: : 0%| | 0/158 [00:00<?, ?it/s]

Suggestion:

Is it possible to change the library so that both single-table and multi-table reports (Quality+Diagnostic and any other that exists) to allow parallelization (either multithreading or multiprocessing) ?

Every calculation of column shapes or trends in column pairs can run in parallel. No need for sequential computation, since each computation is independent. Right?

Thanks!

echatzikyriakidis commented 5 months ago

OK, I have to say that after a change everything goes fast, but in reality I don't know exactly why because have no clue how the library works.

Many of my fields are high-cardinality ones with almost unique text values. So far, I had them as pii=true in the metadata and sdtype=categorical. I decided to remove sdtype completely but the library failed asking for sdtype field. Then I changed it to sdtype=text and now it runs very fast. Why is that? I don't care about these fields, I just want the library to skip them and not taking them account for column shapes or column pair trends.

Is this a correct approach to skip them? I just need some validation.

Thanks.

npatki commented 5 months ago

Hi @echatzikyriakidis, appreciate the feedback.

Before going to parallelization (which we certainly can look into), it is helpful to look back at metadata and ensure everything is running right. SDMetrics uses metadata to make sure it is applying the correct metrics. For example, if you are storing HTTP codes such as 404 (error not found), 500 (server error), 200 (ok), etc. then it should make sure to treat those as discrete categories instead of a numerical distribution.

Here are the docs for what the metadata should look like.

Based on your description, here's what I think is going on:

  1. Sdtype categorical is not compatible with pii. If you mark something as categorical, SDMetrics will dutifully evaluate every single category, which could take a long time if you have non-statistical, high cardinality value such as a text description. See metadata spec.
  2. Any "other" sdtype (that is not categorical, numerical, datetime) is treated as a non-statistical value, meaning that it gets skipped. Therefore, when you apply text, it is skipping the column.

I would recommend continuing to use text to skip over columns that you do not want to include in the report. In the meantime, our team look into improving the experience for indicating which columns to skip.

echatzikyriakidis commented 5 months ago

Hi @npatki,

That's exactly what I ended up doing. I set those high cardinality text fields with sdtype=text and now everything is fast.

Thanks!

npatki commented 5 months ago

Thanks for confirming @echatzikyriakidis. I left a feature request in #548 so make it easier (and more intuitive) to specify which columns you want to ignore when generating a report.