ucscXena / xena-GDC-ETL

Extract, transform and load GDC data onto UCSC Xena
Apache License 2.0
12 stars 8 forks source link

Add CPTAC cohort #57

Closed maryjgoldman closed 5 years ago

maryjgoldman commented 5 years ago

Should be just RNAseq data and clinical data from the API (no xml files)

maryjgoldman commented 5 years ago

take all all all API fields that have data, don't filter by TCGA standards If hit fields with multiple values, highlight these fields for Mary to review and leave out as a first pass

ayan-b commented 5 years ago

Added RNASeq data.

maryjgoldman commented 5 years ago

Clinical API endpoints to keep: https://docs.google.com/spreadsheets/d/11JcG3j0-RzT3PgyQz2XiEKeTqBSs_nnAdOwYnZRmTDk

ayan-b commented 5 years ago

Noticed that some cases.samples.submitter_ids are empty. So should I remove those or set something else (cases.case_id) as an index?

ayan-b commented 5 years ago

Added phenotype matrix generation in https://github.com/ucscXena/xena-GDC-ETL/pull/34.

maryjgoldman commented 5 years ago

I'll check out the cases.samples.submitter_id to see if we should keep. It's not a particularly important field as far as clinical data goes. More for folks looking for batch effects among the submitters.

yunhailuo commented 5 years ago

cases.samples.submitter_id is the sample ID we use for all matrices in TCGA. To my understanding, that's the most important info. What is the sample ID for CPTAC? I couldn't found any sample data on hub.

maryjgoldman commented 5 years ago

cases.samples.submitter_id is the sample ID we use for all matrices in TCGA. To my understanding, that's the most important info. What is the sample ID for CPTAC? I couldn't found any sample data on hub.

Oh, I see! That's what you were saying @ayan-b - it is the index. Sorry for not understanding

@ayan-b can you provide a tsv file with cases.samples.submitter_id and cases.case_id for all samples in the CPTAC cohort? We'll need to look into this more closely. Also, is there a UUID associated with the samples too?

ayan-b commented 5 years ago

Uploaded.

maryjgoldman commented 5 years ago

Also, as a note I'm seeing 322 cases on the GDC portal: Screen Shot 2019-07-03 at 10 52 58 AM

And 469 on ours: Screen Shot 2019-07-03 at 10 52 47 AM

yunhailuo commented 5 years ago

This should be the link for all cases in CPTAC: https://api.gdc.cancer.gov/cases/?fields=samples.submitter_id,samples.sample_id&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.program.name%22%2C%22value%22%3A%5B%22CPTAC%22%5D%7D%7D%5D%7D&size=333&pretty=true

Every sample in it has both submitter_id and sample_id.

yunhailuo commented 5 years ago

Also, as a note I'm seeing 322 cases on the GDC portal: Screen Shot 2019-07-03 at 10 52 58 AM

And 469 on ours: Screen Shot 2019-07-03 at 10 49 16 AM

Against TCGA-GBM?

maryjgoldman commented 5 years ago

Sorry, wrong screenshot. I edited my comment with the correct screenshot. I am also attaching it here

Screen Shot 2019-07-03 at 10 52 47 AM

yunhailuo commented 5 years ago

Sorry, wrong screenshot. I edited my comment with the correct screenshot. I am also attaching it here

Screen Shot 2019-07-03 at 10 52 47 AM

Isn't that number of samples meaning 469 samples from 322 cases? Similar case for TCGA-BRCA where we have n=1,217 for HTSeq - Counts corresponding to 1092 cases on GDC.

maryjgoldman commented 5 years ago

True. My comment was not helpful information for debugging why the cases.samples.submitter_id is blank for so many cases.

I have been googling around to try to figure out why cases.samples.submitter_id is blank for so many cases. I will report when I know more though I'm not seeing much information about this. It is likely we will need to ask the GDC. I do not want to use another column, like cases.demographic.submitter_id or cases.case_id, until we understand why this is happening.

yunhailuo commented 5 years ago

why cases.samples.submitter_id is blank for so many cases

Are you talking about missing cases.samples.submitter_id in the sample TSV? I didn't check all. But some of them might be because the row (which is file instead of sample) is associated with more than one sample:

We need to look into both the code and the data. I need some time later to check that. A quick get from me is:

  1. Each row in phenotype data should be a sample not a file (the id column BS at the end is file ID; that's why I say/guess that sample TSV matrix is file per row). This is a mistake in our code. I'll check later.
  2. It's strange a single aliquot belongs to two samples: