ucscXena / xena-GDC-ETL

Extract, transform and load GDC data onto UCSC Xena
Apache License 2.0
12 stars 8 forks source link

Remove blood normals from new segmented copy number datasets #60

Closed maryjgoldman closed 5 years ago

maryjgoldman commented 5 years ago

The new segmented copy number datasets (DNAcopy) have copy number data for samples that are blood normal. We only want copy number data for the tumor samples.

Screen Shot 2019-07-01 at 4 16 42 PM

The samples in black are the samples for which we want to remove the segmented copy number data for.

newCNV-bloodnormaltoberemoved.txt

This is a bookmark of the above data. It can be imported back into Xena via the bookmark menu

maryjgoldman commented 5 years ago

The Masked segmented copy number dataset already removes these samples. Please follow the code that was already written for this

yunhailuo commented 5 years ago

~CNV and Mased CNV are treated similarly.~

~Remove blood normal was done for clinical data: https://github.com/yunhailuo/xena-GDC-ETL/blob/master/xena_gdc_etl/xena_dataset.py#L1622-L1629~

maryjgoldman commented 5 years ago

One of the samples in the screenshot and bookmark I provided that has no clinical data but has segmented cnv data is TCGA-19-0955-10A. If you look this up in the GDC, you see it is a blood derived.

https://portal.gdc.cancer.gov/cases/f473a0d0-72b6-465e-93f7-122224960e80?bioId=92c6077b-35ab-403a-9f26-374b0702717d

Screen Shot 2019-07-01 at 4 42 24 PM

maryjgoldman commented 5 years ago

If there is no code already written for removing blood derived normal from the masked segmented copy number we need to write it for the segmented cnv

yunhailuo commented 5 years ago

~These are all TCGA data on blood derived normal samples: https://portal.gdc.cancer.gov/repository?facetTab=files&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.program.name%22%2C%22value%22%3A%5B%22TCGA%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.samples.sample_type%22%2C%22value%22%3A%5B%22Blood%20Derived%20Normal%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.access%22%2C%22value%22%3A%5B%22open%22%5D%7D%7D%5D%7D&searchTableTab=files~

~You should see some of them in CNV, Masked CNV and 4 types of SNV data. Have you?~

yunhailuo commented 5 years ago

I'm really sorry. My bad. Masked CNV data are filtered for blood derived normal here: https://github.com/yunhailuo/xena-GDC-ETL/blob/master/xena_gdc_etl/xena_dataset.py#L874-L894

I might be getting really close to Alzheimer or something...

yunhailuo commented 5 years ago

@ayan-b Please add that filter to CNV data (4 lines above). Thank you!

ayan-b commented 5 years ago

@maryjgoldman Added CNV data to the hub.

maryjgoldman commented 5 years ago

This looks good. Blood derived normals are removed