Closed maryjgoldman closed 5 years ago
take all all all API fields that have data, don't filter by TCGA standards If hit fields with multiple values, highlight these fields for Mary to review and leave out as a first pass
Added RNASeq data.
Clinical API endpoints to keep: https://docs.google.com/spreadsheets/d/11JcG3j0-RzT3PgyQz2XiEKeTqBSs_nnAdOwYnZRmTDk
Noticed that some cases.samples.submitter_id
s are empty. So should I remove those or set something else (cases.case_id
) as an index?
Added phenotype matrix generation in https://github.com/ucscXena/xena-GDC-ETL/pull/34.
I'll check out the cases.samples.submitter_id
to see if we should keep. It's not a particularly important field as far as clinical data goes. More for folks looking for batch effects among the submitters.
cases.samples.submitter_id
is the sample ID we use for all matrices in TCGA. To my understanding, that's the most important info. What is the sample ID for CPTAC? I couldn't found any sample data on hub.
cases.samples.submitter_id
is the sample ID we use for all matrices in TCGA. To my understanding, that's the most important info. What is the sample ID for CPTAC? I couldn't found any sample data on hub.
Oh, I see! That's what you were saying @ayan-b - it is the index. Sorry for not understanding
@ayan-b can you provide a tsv file with cases.samples.submitter_id
and cases.case_id
for all samples in the CPTAC cohort? We'll need to look into this more closely. Also, is there a UUID associated with the samples too?
Uploaded.
Also, as a note I'm seeing 322 cases on the GDC portal:
And 469 on ours:
This should be the link for all cases in CPTAC: https://api.gdc.cancer.gov/cases/?fields=samples.submitter_id,samples.sample_id&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.program.name%22%2C%22value%22%3A%5B%22CPTAC%22%5D%7D%7D%5D%7D&size=333&pretty=true
Every sample in it has both submitter_id and sample_id.
Also, as a note I'm seeing 322 cases on the GDC portal:
And 469 on ours:
Against TCGA-GBM?
Sorry, wrong screenshot. I edited my comment with the correct screenshot. I am also attaching it here
Sorry, wrong screenshot. I edited my comment with the correct screenshot. I am also attaching it here
Isn't that number of samples meaning 469 samples from 322 cases? Similar case for TCGA-BRCA where we have n=1,217 for HTSeq - Counts corresponding to 1092 cases on GDC.
True. My comment was not helpful information for debugging why the cases.samples.submitter_id
is blank for so many cases.
I have been googling around to try to figure out why cases.samples.submitter_id
is blank for so many cases. I will report when I know more though I'm not seeing much information about this. It is likely we will need to ask the GDC. I do not want to use another column, like cases.demographic.submitter_id
or cases.case_id
, until we understand why this is happening.
why cases.samples.submitter_id is blank for so many cases
Are you talking about missing cases.samples.submitter_id
in the sample TSV? I didn't check all. But some of them might be because the row (which is file instead of sample) is associated with more than one sample:
We need to look into both the code and the data. I need some time later to check that. A quick get from me is:
Should be just RNAseq data and clinical data from the API (no xml files)