Closed wangqianwen0418 closed 3 years ago
CART cell data:
There are too many missing values. Even if we use some methods to handle the missing value (issue https://github.com/hms-dbmi/OncoThreads/issues/261), the analysis conclusion is not solid
Cancer progression datasets:
Diabete dataset https://data.world/uci/diabetes
Diabetes biomarker disease progression, rat dataset 50 liver samples from GK (GotoKakizake) rats and 51 liver sample from WKY (WistarKyoto) rats feeded with normal diet (ND) and high fat diet (HFD) and sacrificed at 5 different ages: 4, 8, 12, 16, and 20 weeks. So, each time point contains 5 GK samples with ND, 5 GK samples with HFD, 5 WKY samples with ND and 5 WKY samples with HFD. https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE13268
The requirements of datasets:
questions to be answered:
@wangqianwen0418 I reviewed the datasets mentioned in: Every which way? On predicting tumor evolution using cancer progression models, https://doi.org/10.1371/journal.pcbi.1007246.s009 The cancer datasets they used were not actually longitudinal data - it's actually mostly TCGA data - so I don't think it will work for us.
@wangqianwen0418 I reviewed the datasets mentioned in: Every which way? On predicting tumor evolution using cancer progression models, https://doi.org/10.1371/journal.pcbi.1007246.s009 The cancer datasets they used were not actually longitudinal data - it's actually mostly TCGA data - so I don't think it will work for us.
Thanks for the feedback
Potential datasets I found: Evolution of Cytogenetically Normal Acute Myeloid Leukemia During Therapy and Relapse: An Exome Sequencing Study of 50 Patients Greif et al, Clin Cancer Res 2018 https://clincancerres.aacrjournals.org/content/24/7/1716.long
CLL https://www.nature.com/articles/s41467-017-02329-y data in dbGaP - can we get this?
TRACERx Lung https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5812436/ genomic data in cbioportal, need to figure out clinical data situation
Here is the initial analysis of the CN-AML datasets. Since the original dataset has many redundant features, I consult the paper and select gene (DNMT3A, FLT3, IDH2, IDH1) as timepoints features, gender, age, relapse days, AML type, ELN risk as patient level features. @tmazor , does this pre-processing make sense to you? Is there any other features that you think should be added?
@tmazor, attached please find the sample & mutation files for the CN-AML dataset. AML_mutation.txt AML_sample_freq.txt AML_timeline.txt AML_patients.txt Please let me know if you have any problem about the data. Many thanks!
related issue https://github.com/hms-dbmi/OncoThreads/issues/251: synthea covid19 dataset