use cases - Githubissues

wangqianwen0418 commented 3 years ago

related issue https://github.com/hms-dbmi/OncoThreads/issues/251: synthea covid19 dataset

wangqianwen0418 commented 3 years ago

CART cell data:

There are too many missing values. Even if we use some methods to handle the missing value (issue https://github.com/hms-dbmi/OncoThreads/issues/261), the analysis conclusion is not solid

wangqianwen0418 commented 3 years ago

Cancer progression datasets:
- From the paper: Every which way? On predicting tumor evolution using cancer progression models, https://doi.org/10.1371/journal.pcbi.1007246.s009
- The authors tested ML models on 22 cancer datasets. Datasets can be downloaded. Need to further check the datasets.
Diabete dataset https://data.world/uci/diabetes
- observation data about 74 patients in one month (3 times a day)
- observations include: date, time, observation code, observation value
- observation can be categorised into 5 groups: insulin, glucose, symptoms, ingestion, activity
- don't have patient-level features (e.g., gender, BMI)
Diabetes biomarker disease progression, rat dataset 50 liver samples from GK (GotoKakizake) rats and 51 liver sample from WKY (WistarKyoto) rats feeded with normal diet (ND) and high fat diet (HFD) and sacrificed at 5 different ages: 4, 8, 12, 16, and 20 weeks. So, each time point contains 5 GK samples with ND, 5 GK samples with HFD, 5 WKY samples with ND and 5 WKY samples with HFD. https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE13268

wangqianwen0418 commented 3 years ago

The requirements of datasets:

multiple timepoints (>= 3)
genomic data preferable, not required
preferable cancer-related
number of patients ( >= 25)
multiple timepoint features (preferable multiple feature types)
limited missing data
preferable event features, preferable patient features => longitudinal biological correlation
assume to have state transition patterns

questions to be answered:

tmazor commented 3 years ago

@wangqianwen0418 I reviewed the datasets mentioned in: Every which way? On predicting tumor evolution using cancer progression models, https://doi.org/10.1371/journal.pcbi.1007246.s009 The cancer datasets they used were not actually longitudinal data - it's actually mostly TCGA data - so I don't think it will work for us.

wangqianwen0418 commented 3 years ago

@wangqianwen0418 I reviewed the datasets mentioned in: Every which way? On predicting tumor evolution using cancer progression models, https://doi.org/10.1371/journal.pcbi.1007246.s009 The cancer datasets they used were not actually longitudinal data - it's actually mostly TCGA data - so I don't think it will work for us.

Thanks for the feedback

wangqianwen0418 commented 3 years ago

datasets that I have checked but may not suitable:

MIMIC III: https://mimic.physionet.org/
children obesity: https://storage.googleapis.com/synthea-public/synthetic_denver.zip
SyntheticMass: https://synthea.mitre.org/downloads why: These datasets contain a wide variety of EHR. It is hard to find a target patient cohort with the same disease and regular repeated timepoint values

datasets that I am still working on:

diabetes: https://data.world/uci/diabetes. I have downloaded the dataset and will process it later.
parkinson: https://www.michaeljfox.org/news/parkinsons-progression-markers-initiative-ppmi I have applied for the access to this dataset. The data committee asked me to "reapply with additional information on your proposed analyses". I have reapplied and am waiting for the feedback. Still not sure about whether this is a suitable dataset.

tmazor commented 3 years ago

Potential datasets I found: Evolution of Cytogenetically Normal Acute Myeloid Leukemia During Therapy and Relapse: An Exome Sequencing Study of 50 Patients Greif et al, Clin Cancer Res 2018 https://clincancerres.aacrjournals.org/content/24/7/1716.long

CLL https://www.nature.com/articles/s41467-017-02329-y data in dbGaP - can we get this?

TRACERx Lung https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5812436/ genomic data in cbioportal, need to figure out clinical data situation

wangqianwen0418 commented 3 years ago

Here is the initial analysis of the CN-AML datasets. Since the original dataset has many redundant features, I consult the paper and select gene (DNMT3A, FLT3, IDH2, IDH1) as timepoints features, gender, age, relapse days, AML type, ELN risk as patient level features. @tmazor , does this pre-processing make sense to you? Is there any other features that you think should be added?

wangqianwen0418 commented 3 years ago

@tmazor, attached please find the sample & mutation files for the CN-AML dataset. AML_mutation.txt AML_sample_freq.txt AML_timeline.txt AML_patients.txt Please let me know if you have any problem about the data. Many thanks!

hms-dbmi / OncoThreads

use cases #262

datasets that I have checked but may not suitable:

datasets that I am still working on: