greenelab / tybalt

Training and evaluating a variational autoencoder for pan-cancer gene expression data
BSD 3-Clause "New" or "Revised" License
162 stars 62 forks source link

Matching pancancer expression to metadata #152

Closed GlastonburyC closed 4 years ago

GlastonburyC commented 4 years ago

Hi @gwaygenomics @cgreene I would like to map samples (index values) in pancan_scaled_rnaseq.tsv.gz to the metadata tcga-clinical_data.tsv.

Currently the index values for the rnaseq are not unique and I am unable to match them to the metadata.

Could you please advise on how, for example, I could subset the pancancer data to just a single cancer subtype (tying it to the metadata).

gwaybio commented 4 years ago

pancan_scaled_rnaseq.tsv.gz includes sample level information while tcga-clinical_data.tsv includes patient level information. The sample level information identifiers are much more descriptive than the patient level information.

Mapping between the two files can be done by subsetting TCGA barcodes. Info here: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/

This file might also be helpful: https://github.com/greenelab/pancancer/blob/master/data/sample_freeze.tsv

GlastonburyC commented 4 years ago

Isn't it the opposite? pancan_scaled_rnaseq.tsv.gz looks like this:

Screenshot 2019-11-12 at 16 57 02

Where as the clinical data contains the full barcode. Screenshot 2019-11-12 at 17 06 04

gwaybio commented 4 years ago

in either direction, the mapping can be done the same way. I don't think i've used the portion_id column though. Is there a sample_id column or something similar?

GlastonburyC commented 4 years ago

This file: https://github.com/greenelab/pancancer/blob/master/data/sample_freeze.tsv made it trivial. Thanks