sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
209 stars 44 forks source link

QualityReport with `CorrelationSimilarity` to a column that contains only `nans` generates a `ValueError` #351

Open pvk-developer opened 1 year ago

pvk-developer commented 1 year ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

When running quality report we are expecting it to be fault tolerant, meaning that if a single metric crashes during computation the report should catch those errors and continue with the other metrics and just report NaN for that metric. However when you have a column full of nans or nulls, the following error occurs for the CorrelationSimilarity:

ValueError: x and y must have length at least 2.

For some reason that ValueError is not being captured by the quality report: https://github.com/sdv-dev/SDMetrics/blob/8b79accdf1ceb83780b20e8d53a66e0b7f68a54e/sdmetrics/reports/single_table/quality_report.py#L76-L81

Steps to reproduce

import pandas as pd
import numpy as np

real = pd.DataFrame({'a': [np.nan, np.nan, np.nan], 'b': [1, 2, 3]})
synth = pd.DataFrame({'a': [0, 1, 2], 'b': [1, 2, 3]})

from sdmetrics.reports.single_table import QualityReport
report = QualityReport()
metadata = {'columns': {'a': {'sdtype': 'numerical'}, 'b': {'sdtype': 'numerical'}}}

report.generate(real, synth, metadata)
Creating report:  50%|████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                            | 2/4 [00:00<00:00, 329.25it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

----> 1 report.generate(real, synth, metadata)

File ~/Projects/SDV/SDMetrics/sdmetrics/reports/single_table/quality_report.py:77, in QualityReport.generate(self, real_data, synthetic_data, metadata, verbose)
     75 for metric in tqdm.tqdm(metrics, desc='Creating report', disable=(not verbose)):
     76     try:
---> 77         self._metric_results[metric.__name__] = metric.compute_breakdown(
     78             real_data, synthetic_data, metadata)
     79     except IncomputableMetricError:
     80         # Metric is not compatible with this dataset.
     81         self._metric_results[metric.__name__] = {}

File ~/Projects/SDV/SDMetrics/sdmetrics/single_table/multi_column_pairs.py:129, in MultiColumnPairsMetric.compute_breakdown(cls, real_data, synthetic_data, metadata, **kwargs)
    127     real = real_data[list(sorted_columns)]
    128     synthetic = synthetic_data[list(sorted_columns)]
--> 129     breakdown[sorted_columns] = cls.column_pairs_metric.compute_breakdown(
    130         real, synthetic, **kwargs)
    132 return breakdown

File ~/Projects/SDV/SDMetrics/sdmetrics/column_pairs/statistical/correlation_similarity.py:103, in CorrelationSimilarity.compute_breakdown(cls, real_data, synthetic_data, coefficient)
     99 else:
    100     raise ValueError(f'requested coefficient {coefficient} is not valid. '
    101                      'Please choose either Pearson or Spearman.')
--> 103 correlation_real, _ = correlation_fn(real_data[column1], real_data[column2])
    104 correlation_synthetic, _ = correlation_fn(synthetic_data[column1], synthetic_data[column2])
    106 if np.isnan(correlation_real) or np.isnan(correlation_synthetic):

File ~/.virtualenvs/SDMetrics/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4411, in pearsonr(x, y, alternative)
   4408     raise ValueError('x and y must have the same length.')
   4410 if n < 2:
-> 4411     raise ValueError('x and y must have length at least 2.')
   4413 x = np.asarray(x)
   4414 y = np.asarray(y)

ValueError: x and y must have length at least 2.
npatki commented 1 year ago

Requirements:

  1. The base metric for CorrelationSimilarity should produce an error when there are all NaN values, as the correlation is not defined in this case.
  2. The Quality Report should do a better job at catching the error, potentially surfacing it as a warning and then moving on with the other metrics. The report should not crash.

I believe (2) will be taken care of by the updated Column Pair Trends property, as described in issue #356 (single table) and #358 (multi table).