sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

Better way to ignore columns when running a report #548

Open npatki opened 5 months ago

npatki commented 5 months ago

Problem Description

As described in #546, I may want to ignore certain columns in a dataset when running a report (quality or diagnostic). It is not completely intuitive how to do this.

  1. The metadata requires that all columns be described. So you cannot ask a report to ignore a column simply by removing it from the metadata.
  2. It is unclear from the metadata spec which columns will be ignored and which will be used for evaluation

Actual Solution: If you mark a column with an "other" sdtype (not categorical, numerical, datetime, etc.), then SDV will assume it is non-statistical pii and therefore ignore the column. For example, using sdtype 'text' is sufficient to get a report to ignore the column.

Expected behavior

The metadata spec should probably remain as-is, because in the future we may decide to add metrics for specific sdtypes.

However, perhaps the report itself should allow you to specify which columns to ignore?

srinify commented 1 month ago

Another use case: the visualization phase after a Quality Report is generated.

If a table has a large number of columns, the generated visualizations become hard to interact with and use for insight gathering. This is an example from the loan_applications dataset:

Screenshot 2024-08-16 at 10 59 00 AM

If I want to focus on ~10 columns in the Quality Report, not an easy way to do this natively. Potential solutions here could either manifest as: