sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.31k stars 304 forks source link

Model/Procedure to synthesise relational (multi-table) data with temporal dependencies #867

Open viandres opened 2 years ago

viandres commented 2 years ago

Problem Description

Data, like the MIMIC IV dataset (esp. patients, admissions, diagnoses, prescriptions and procedures), can be relational (multi-table) and also sequential (temporal/inter-row dependencies). Currently there is no way of dealing with such data, because the PAR model for sequential data has no configurations for dealing with multi-table structure and likewise the other models with the option to give metadata info are not suitable for temporal data.

Expected behavior

An (well-performing) model (e.g. MTGAN, TimeGAN, PAR) to synthesise sequential data splittet over multiple tables (e.g. MIMIC IV dataset) that captures statistical properties of the data, preserves privacy and captures the longitudinality (temporal associations) within the data.

Example code for usage (with MIMIC data):

# Store tables in a dictionary.
tables = {}
for df in dfs:
    tables[df.name] = df

# Add metadata infos.
metadata = Metadata()

metadata.add_table(
    name='patients',
    data=tables['patients'],
    primary_key='subject_id'
)

metadata.add_table(
    name='admissions',
    data=tables['admissions'],
    primary_key='hadm_id',
    parent='patients',
    foreign_key='subject_id'
)

(...)

metadata.add_table(
    name='procedures',
    data=tables['procedures'],
    parent='admissions',
    foreign_key='hadm_id'
)
metadata.to_dict()

# Set time parameters for each table as dictionaries.
entity_columns = {"patients":[],"admissions":[], ..., "procedures":[]}
context_columns = {"patients":[], "admissions":[], ..., "procedures":[]}
sequence_indices = {"patients":[], "admissions":["admittime", "dischtime"], ..., "procedures":[]}

# Fit/train some sequential model.
model = SomeTimeModel(metadata, sequence_indices, context_columns, entity_columns)
model.fit(tables)

# syn_data should be a dictionary with the same structure as the tables variable.
syn_data = model.sample()
npatki commented 2 years ago

Thanks for filling @vandreslime. I'll cross reference this with #863, where the original question came up.

We'll keep this open as a feature request and update it as we make progress. To help us prioritize, it would be great if you had any more details about your use case. How are you hoping to use the synthetic data?

Also, I am assuming that in a multi-table dataset, you may have a mix of some tables that are sequential (have temporal dependencies) and others that are not. Is that correct?

viandres commented 2 years ago

Thank you so much @npatki !

I will try to clarify the use case/problem a bit:

The goal is to be able to synthesise any electronic healthcare records (EHR) data to ensure patient privacy and then being able to perform any kind of predictions with machine/deep learning for diseases, death, medication, procedures, etc.

The data structure is usually very similar to the one of the MIMIC IV dataset, where the most important tables are patients, hospital admissions, prescriptions, procedures and diagnoses. In this case, the only table with time columns is the hospital admissions, where you get an admission and a release date (and time). The procedures, diagnoses and prescriptions tables are linked to the admissions table with an admission id and the admissions are linked to the patients demographics table with a patient id. So based on the admission time, naturally it is relevant when a patient gets a diagnose, medication or procedure and the order of them. Unfortunately we only know that using the admission id and time. There are no exact times of diagnose of medication available. Also, all patients have different number of diagnoses, procedures and so on. The only static table is probably the patients table, since diagnoses, procedures and prescriptions have a temporal dependency using the according admission/release time from the other table admissions. That could probably be tricky or require some preprocessing on the tables.

npatki commented 2 years ago

Thanks for the detailed explanation @vandreslime. Definitely helps me see why the temporal dependencies need to be included in the multi-table use case.

Accommodating this feature into the SDV will take some time. I'll leave this issue up and we can use it to track progress and provide updates as we go along.