Better way to ignore columns when running a report

sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.

MIT License

201 stars 45 forks source link

Problem Description

As described in #546, I may want to ignore certain columns in a dataset when running a report (quality or diagnostic). It is not completely intuitive how to do this.

The metadata requires that all columns be described. So you cannot ask a report to ignore a column simply by removing it from the metadata.
It is unclear from the metadata spec which columns will be ignored and which will be used for evaluation

Actual Solution: If you mark a column with an "other" sdtype (not categorical, numerical, datetime, etc.), then SDV will assume it is non-statistical pii and therefore ignore the column. For example, using sdtype 'text' is sufficient to get a report to ignore the column.

Expected behavior

The metadata spec should probably remain as-is, because in the future we may decide to add metrics for specific sdtypes.

However, perhaps the report itself should allow you to specify which columns to ignore?

sdv-dev / SDMetrics

Better way to ignore columns when running a report #548

Problem Description

Expected behavior