Closed maryjgoldman closed 5 years ago
Overall, it is for sure solid if we filter clinical data with genomic data we actually get. But this could be tedious since it has to wait for all genomic data good to go. If we assume every single "Sequencing Reads" will be analyzed and released as open genomic data we could have, we can potentially use this link to find out all sample IDs associated with "genomic data" in our definition. We could use this list of IDs to filter against clinical data after this: https://github.com/yunhailuo/xena-GDC-ETL/blob/master/xena_gdc_etl/xena_dataset.py#L1612-L1613 (Honestly the separation between GDC and Xena is not quite clear here; this is Xena specific decision but field specifications in get_samples_clinical
are also kind of Xena specific).
Potential dummy code could be:
sequenced_samples = gdc.search(
endpoint='files',
in_filter={
'cases.project.project_id': self.project,
'data_category': 'Sequencing Reads',
},
field=['cases.samples.submitter_id']
)['cases.samples.submitter_id'].tolist()
api_clin = api_clin[api_clin['submitter_id.samples'].isin(sequenced_samples)]
As a note, this is also a problem for the TARGET-ALL-P3 STAR counts data
Here are all sample_type for TARGET-ALL-P3 STAR counts data: https://api.gdc.cancer.gov/cases?facets=samples.sample_type&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TARGET-ALL-P3%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.access%22%2C%22value%22%3A%5B%22open%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22STAR%20-%20Counts%22%5D%7D%7D%5D%7D&pretty=true
aggregations: {
samples.sample_type: {
buckets: [
{
key: "Primary Blood Derived Cancer - Bone Marrow",
doc_count: 93
},
{
key: "Bone Marrow Normal",
doc_count: 59
},
{
key: "Recurrent Blood Derived Cancer - Bone Marrow",
doc_count: 15
},
{
key: "Primary Blood Derived Cancer - Peripheral Blood",
doc_count: 7
},
{
key: "Fibroblasts from Bone Marrow Normal",
doc_count: 7
},
{
key: "Blood Derived Normal",
doc_count: 4
},
{
key: "Recurrent Blood Derived Cancer - Peripheral Blood",
doc_count: 1
}
]
}
}
What to keep and/or what to delete?
I need to see the phenotype data for the TARGET-ALL-P3 phenotype data before I will know. I will create a separate issue for this
EDIT: the issue is here: #69
@maryjgoldman Updated data in the hub.
These are samples that have slide image data and phenotype data but no genomic data. These samples tend to end in 'Z', rather than 'A' or 'B'
Samples that need to be excluded are in a black rectangle
tooManySamples.txt Bookmark to see this for BRCA