sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
204 stars 44 forks source link

Add metrics to evaluate fidelity of longitudinal datasets #198

Open ashafquat-mdsol opened 2 years ago

ashafquat-mdsol commented 2 years ago

Suggested tests

Definition SubjectID = Identifier for a subject Event = An identifier to define the type of event Start date = date of event starting End date = date of event ending Event Duration = Days between End date and start date

npatki commented 2 years ago

Hi @ashafquat-mdsol, thanks for filing this issue. I suggest we align the terminology to what the PAR model uses:

I see many similarities between the metrics you describe and the MSAS algorithm, described in the most recent PAR model paper (http://arxiv.org/abs/2207.14406). At a high level, the algorithm works in the following way:

  1. Compute a metric for every sequence in the real data to get a distribution X
  2. Compute the same metric for every sequence in the synthetic data to get a distribution X’
  3. Return the KSComplement score, which quantifies the similarity between distributions X and X’

Ideally, we should create 1 issue per metric, as we always aim to have each pull request close a specific issue. I suggest that we can start with the sequence length distribution metric, as it seems the simplest of the ones you have listed so far. We can file a new issue for it, and I can provide some feedback about the API (metric, parameter name, etc.) before implementation. Does that sound good?

Based on how that goes, we can repeat the process for the other metrics.

ashafquat-mdsol commented 2 years ago

@npatki that sounds perfect. We wanted to reach an alignment on the metrics to implement so we can use this issue to discuss the set we want to implement. I will create a separate ticket for the sequence length distribution metric and we can definitely go from there.

ashafquat-mdsol commented 2 years ago

Just created this issue: https://github.com/sdv-dev/SDMetrics/issues/203 for the sequence length distribution.