Open ashafquat-mdsol opened 2 years ago
Hi @ashafquat-mdsol, thanks for filing this issue. I suggest we align the terminology to what the PAR model uses:
I see many similarities between the metrics you describe and the MSAS algorithm, described in the most recent PAR model paper (http://arxiv.org/abs/2207.14406). At a high level, the algorithm works in the following way:
Ideally, we should create 1 issue per metric, as we always aim to have each pull request close a specific issue. I suggest that we can start with the sequence length distribution metric, as it seems the simplest of the ones you have listed so far. We can file a new issue for it, and I can provide some feedback about the API (metric, parameter name, etc.) before implementation. Does that sound good?
Based on how that goes, we can repeat the process for the other metrics.
@npatki that sounds perfect. We wanted to reach an alignment on the metrics to implement so we can use this issue to discuss the set we want to implement. I will create a separate ticket for the sequence length distribution metric and we can definitely go from there.
Just created this issue: https://github.com/sdv-dev/SDMetrics/issues/203 for the sequence length distribution.
Suggested tests
Conditional probability distribution in simulated vs. - conditional probability of Event A|B is calculated as the probability of seeing Event B within X days of Event A’s start.
Bag of words - Event frequencies can be used to define a features matrix per person and centroids created for the original dataset. The number of people assigned to each of the centroids in the original dataset vs simulated can be compared using a distance metric.
Event durations -
Time to event analysis- This test requires an event to be marked as a reference event (e.g. the first event that is recorded for a subject). The reference event occurs in each subject’s timeline.
Event sequence length distribution - where event sequence length is the number of events recorded for each person. Distance in the distribution of event sequence length between original and simulated can then be calculated.
N-grams frequency - An event sequence can be generated per person by making a list of events experienced by a person/unit ordered by the start date of an event. N-grams can then be computed by creating N-grams from this sequence of strings/events per person. Fidelity is quantified using MSE/R2 comparing N-gram frequency in simulated vs original datasets using varying values of N
Definition SubjectID = Identifier for a subject Event = An identifier to define the type of event Start date = date of event starting End date = date of event ending Event Duration = Days between End date and start date