Samples with phenotype data but no genomic data are still being included

maryjgoldman commented 5 years ago

These are samples that have slide image data and phenotype data but no genomic data. These samples tend to end in 'Z', rather than 'A' or 'B'

Screen Shot 2019-07-01 at 4 25 04 PM Samples that need to be excluded are in a black rectangle

tooManySamples.txt Bookmark to see this for BRCA

yunhailuo commented 5 years ago

Overall, it is for sure solid if we filter clinical data with genomic data we actually get. But this could be tedious since it has to wait for all genomic data good to go. If we assume every single "Sequencing Reads" will be analyzed and released as open genomic data we could have, we can potentially use this link to find out all sample IDs associated with "genomic data" in our definition. We could use this list of IDs to filter against clinical data after this: https://github.com/yunhailuo/xena-GDC-ETL/blob/master/xena_gdc_etl/xena_dataset.py#L1612-L1613 (Honestly the separation between GDC and Xena is not quite clear here; this is Xena specific decision but field specifications in get_samples_clinical are also kind of Xena specific).

Potential dummy code could be:

sequenced_samples = gdc.search(
    endpoint='files',
    in_filter={
        'cases.project.project_id': self.project,
        'data_category': 'Sequencing Reads',
    },
    field=['cases.samples.submitter_id']
)['cases.samples.submitter_id'].tolist()
api_clin = api_clin[api_clin['submitter_id.samples'].isin(sequenced_samples)]

maryjgoldman commented 5 years ago

As a note, this is also a problem for the TARGET-ALL-P3 STAR counts data

yunhailuo commented 5 years ago

Here are all sample_type for TARGET-ALL-P3 STAR counts data: https://api.gdc.cancer.gov/cases?facets=samples.sample_type&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.project_id%22%2C%22value%22%3A%5B%22TARGET-ALL-P3%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.access%22%2C%22value%22%3A%5B%22open%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22STAR%20-%20Counts%22%5D%7D%7D%5D%7D&pretty=true

aggregations: {
    samples.sample_type: {
        buckets: [
            {
                key: "Primary Blood Derived Cancer - Bone Marrow",
                doc_count: 93
            },
            {
                key: "Bone Marrow Normal",
                doc_count: 59
            },
            {
                key: "Recurrent Blood Derived Cancer - Bone Marrow",
                doc_count: 15
            },
            {
                key: "Primary Blood Derived Cancer - Peripheral Blood",
                doc_count: 7
            },
            {
                key: "Fibroblasts from Bone Marrow Normal",
                doc_count: 7
            },
            {
                key: "Blood Derived Normal",
                doc_count: 4
            },
            {
                key: "Recurrent Blood Derived Cancer - Peripheral Blood",
                doc_count: 1
            }
        ]
    }
}

What to keep and/or what to delete?

maryjgoldman commented 5 years ago

I need to see the phenotype data for the TARGET-ALL-P3 phenotype data before I will know. I will create a separate issue for this

EDIT: the issue is here: #69

ayan-b commented 5 years ago

@maryjgoldman Updated data in the hub.

ucscXena / xena-GDC-ETL

Samples with phenotype data but no genomic data are still being included #63