Add metrics to evaluate fidelity of longitudinal datasets

ashafquat-mdsol commented 2 years ago

Suggested tests

Conditional probability distribution in simulated vs. - conditional probability of Event A|B is calculated as the probability of seeing Event B within X days of Event A’s start.
- Differences within the probability distributions can be computed.
- New conditional probabilities that are not seen in original can be flagged as artifacts and all conditional probabilities that exist in original but not in simulated can be flagged as missing
Bag of words - Event frequencies can be used to define a features matrix per person and centroids created for the original dataset. The number of people assigned to each of the centroids in the original dataset vs simulated can be compared using a distance metric.
Event durations -
- t-test/KS test can be used to compare the distribution of event durations in original vs. simulated. Where the differences are significant these can be flagged.
- All event durations missing can be flagged.
- Mean, median, percentiles, standard deviation, min, max of event durations per event type are calculated and plotted on a line plot. The MSE/R2 per plot quantifies the alignment between original and simulated
Time to event analysis- This test requires an event to be marked as a reference event (e.g. the first event that is recorded for a subject). The reference event occurs in each subject’s timeline.
- Time to event for event X is calculated as time between reference event occurring and event X occurring.
- Mean, median, percentiles, standard deviation, min, max of time to event per event type are calculated and plotted on a line plot. The MSE/R2 per plot quantifies the alignment between original and simulated-
- t-test/KS test can be used to compare the distribution of time to event in original vs. simulated per event type. Where the differences are significant these can be flagged.
- Survival probability/Log-rank test per event type can be used to identify differences in original vs simulated. (For the model, Event = 1 if Event X (e.g. Death) occurs in the subject's timeline, 0 otherwise. Time to Event = time between reference event and Event X occurring if Event =1; time between reference event and last event observed. )
Event sequence length distribution - where event sequence length is the number of events recorded for each person. Distance in the distribution of event sequence length between original and simulated can then be calculated.
N-grams frequency - An event sequence can be generated per person by making a list of events experienced by a person/unit ordered by the start date of an event. N-grams can then be computed by creating N-grams from this sequence of strings/events per person. Fidelity is quantified using MSE/R2 comparing N-gram frequency in simulated vs original datasets using varying values of N

Definition SubjectID = Identifier for a subject Event = An identifier to define the type of event Start date = date of event starting End date = date of event ending Event Duration = Days between End date and start date

npatki commented 2 years ago

Hi @ashafquat-mdsol, thanks for filing this issue. I suggest we align the terminology to what the PAR model uses:

Entity columns: Identify which row belongs to which sequence
Sequence index: (Optional) Identifies the order of the sequence, eg a date time column

I see many similarities between the metrics you describe and the MSAS algorithm, described in the most recent PAR model paper (http://arxiv.org/abs/2207.14406). At a high level, the algorithm works in the following way:

Compute a metric for every sequence in the real data to get a distribution X
Compute the same metric for every sequence in the synthetic data to get a distribution X’
Return the KSComplement score, which quantifies the similarity between distributions X and X’

Ideally, we should create 1 issue per metric, as we always aim to have each pull request close a specific issue. I suggest that we can start with the sequence length distribution metric, as it seems the simplest of the ones you have listed so far. We can file a new issue for it, and I can provide some feedback about the API (metric, parameter name, etc.) before implementation. Does that sound good?

Based on how that goes, we can repeat the process for the other metrics.

ashafquat-mdsol commented 2 years ago

@npatki that sounds perfect. We wanted to reach an alignment on the metrics to implement so we can use this issue to discuss the set we want to implement. I will create a separate ticket for the sequence length distribution metric and we can definitely go from there.

ashafquat-mdsol commented 2 years ago

Just created this issue: https://github.com/sdv-dev/SDMetrics/issues/203 for the sequence length distribution.

sdv-dev / SDMetrics

Add metrics to evaluate fidelity of longitudinal datasets #198