Closed echatzikyriakidis closed 5 months ago
OK, I have to say that after a change everything goes fast, but in reality I don't know exactly why because have no clue how the library works.
Many of my fields are high-cardinality ones with almost unique text values. So far, I had them as pii=true
in the metadata and sdtype=categorical
. I decided to remove sdtype completely but the library failed asking for sdtype field. Then I changed it to sdtype=text and now it runs very fast. Why is that? I don't care about these fields, I just want the library to skip them and not taking them account for column shapes or column pair trends.
Is this a correct approach to skip them? I just need some validation.
Thanks.
Hi @echatzikyriakidis, appreciate the feedback.
Before going to parallelization (which we certainly can look into), it is helpful to look back at metadata and ensure everything is running right. SDMetrics uses metadata to make sure it is applying the correct metrics. For example, if you are storing HTTP codes such as 404
(error not found), 500
(server error), 200
(ok), etc. then it should make sure to treat those as discrete categories instead of a numerical distribution.
Here are the docs for what the metadata should look like.
Based on your description, here's what I think is going on:
categorical
is not compatible with pii
. If you mark something as categorical, SDMetrics will dutifully evaluate every single category, which could take a long time if you have non-statistical, high cardinality value such as a text description. See metadata spec. text
, it is skipping the column.I would recommend continuing to use text
to skip over columns that you do not want to include in the report. In the meantime, our team look into improving the experience for indicating which columns to skip.
Hi @npatki,
That's exactly what I ended up doing. I set those high cardinality text fields with sdtype=text and now everything is fast.
Thanks!
Thanks for confirming @echatzikyriakidis. I left a feature request in #548 so make it easier (and more intuitive) to specify which columns you want to ignore when generating a report.
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
Hi @npatki,
It seems that , it is too slow when running Column Pair Trends from Quality Report.
My current example:
Generating report ... (1/4) Evaluating Column Shapes: : 100%|██████████| 59/59 [03:39<00:00, 3.72s/it] (2/4) Evaluating Column Pair Trends: : 0%| | 0/158 [00:00<?, ?it/s]
Suggestion:
Is it possible to change the library so that both single-table and multi-table reports (Quality+Diagnostic and any other that exists) to allow parallelization (either multithreading or multiprocessing) ?
Every calculation of column shapes or trends in column pairs can run in parallel. No need for sequential computation, since each computation is independent. Right?
Thanks!