mschubert / TCGAbiolinks-downloader

GNU Make-driven workflow to download TCGA data via the TCGAbiolinks package
8 stars 0 forks source link

Extended Clinical data from GDC TCGA #5

Open jperales opened 6 years ago

jperales commented 6 years ago

Hi, This repository looks very promising! I really like the idea of doing a Make to get all the TCGA data in a tidy format for R.

However, I realized that the XENAbrowser datapages > GCD TCGA-cohort provide extended metadata in the clinical data, which is not included by your approach. XENAbrowser includes drug_name, primary_therapy_outcome_success, etc.

Source: https://xenabrowser.net/datapages/?dataset=TCGA-PAAD/Xena_Matrices/TCGA-PAAD.GDC_phenotype.tsv&host=https://gdc.xenahubs.net .

The most similar variables by your approach indeed are treatment_agents or treatment_or_therapy. However you only can find NAs for all patients:

load("./clinical/TCGA-PAAD.RData")
> grep("treat|drug|outcome",colnames(query),value=TRUE)
[1] "treatment_id"          "days_to_treatment"     "treatment_intent_type" "treatment_or_therapy" 
> all(is.na(query$therapeutic_agents))
[1] TRUE
> all(is.na(query$treatment_or_therapy))
[1] TRUE

Do you think we somehow could include this extended data? It would be very useful for functional genomics. Thanks!

Best, Javier

mschubert commented 6 years ago

This looks like a good addition, thank you for pointing that out!

If I remember correectly, TCGAbiolinks already includes some of this information as colData in the SummarizedExperiment objects (e.g. for RNA-seq data), but they should be available from the clinical data object.