sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

Implement MSAS #199

Open LiFaytheGoblin opened 2 years ago

LiFaytheGoblin commented 2 years ago

Problem Description

The current Metrics implemented in SDV do not specifically measure the quality of sequences generated with CPAR.

Expected behavior

MSAS is a metric for sequential data quality, detailed in http://arxiv.org/abs/2207.14406. It should be implemented in SDV.

npatki commented 2 years ago

Thanks for filing @LiFaytheGoblin. We'll keep this open to track as we make progress on it.

Just a note that MSAS refers to our overall algorithm of computing sequential data quality, and works in the following steps:

  1. Compute a metric for every sequence in the real data to get a distribution X
  2. Compute the same metric for every sequence in the synthetic data to get a distribution X'
  3. Use the KSComplement test to compare the distributions X and X'

Various metrics can be used in step 1. In the paper we used: length, mean, median, standard deviation and the difference between a row n and some step n+t.

Are there any particular metrics that are more or less important to your use case?

npatki commented 2 years ago

FYI some metrics that will use MSAS are actively being discussed in #198