Closed justaddcoffee closed 3 months ago
Looks like GDC has an API that we can use to pull in the data from directly: https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/#getting-started
We don't need to use the CDA library to bring this data in. We can write a python client/script that pulls data by interacting with the above API.
I can work on modifying the code here: https://github.com/monarch-initiative/oncoexporter/blob/develop/src/oncoexporter/cda/cda_mutation_factory.py to bring in mutation data directly from GDC.
great @sujaypatil96!
My recollection from what Brian & Matt said at the hackathon was that the CRDC-H model did not have mutation information included in it yet. If you look at the Appendix A for the GDC API, I do not see any of the fields that are available in the CDA mutation endpoint.
My view is that we do not need to be restricted to using CDA if it doesn't do what we want, but that we should not recreate existing capabilities unnecessarily. The CDA has some process for producing the mutation table, and it makes sense to me to at least try to understand how they did that before trying to build our own from scratch.
However, if Sujay can figure out an easy way to get the mutation data directly from GDC I certainly don't have any objections to do things that way.
Let's consider a subject/case (a8b1f6e7-2bcf-460d-b1c6-1792a9801119
) browsable on GDC: https://portal.gdc.cancer.gov/cases/a8b1f6e7-2bcf-460d-b1c6-1792a9801119
My understanding is that the mutation information that we want to pull for cases from GDC is what's available under the "MOST FREQUENT SOMATIC MUTATIONS" section/table on the above webpage say. To obtain this data we would need to query the "SSM (Simple Somatic Mutation)" endpoint. The GDC mutation data can be found at /ssms
endpoint.
thanks all
I'd suggest that Sujay does a first pass at collecting MOST FREQUENT SOMATIC MUTATIONS using the GDC API, and then we can have a closer look. Does that sound reasonable?
Ahh, my problem was I did not look at the "Data Analysis" page, which describes the mutation endpoints: https://docs.gdc.cancer.gov/API/Users_Guide/Data_Analysis/
I've experimented with pulling mutation data from GDC directly here: https://gist.github.com/sujaypatil96/5659f766abeed7adf52fb6ce771e5552
I was looking at the list of mutation fields to pull in (from CDA) in the oncoexporter code and saw this: https://github.com/monarch-initiative/oncoexporter/blob/develop/src/oncoexporter/cda/cda_mutation_factory.py#L18-L50
Based on that list (and the full list of fields that we can pull mutation information for here), I wrote a quick Python script to demo/illustrate how we can use the GDC API (specifically the /ssms
endpoint) to pull in mutation information for a specific case (case_id).
There's more information available at the /ssm_occurences
endpoint which we can retrieve. See https://docs.gdc.cancer.gov/API/Users_Guide/Data_Analysis/ for examples.
Thanks @sujaypatil96! We'll take a look hopefully today
cc: @pnrobinson
@ielis could you have a go at incorporating Sujay's code into oncoexporter?
I think we just need to put Sujay's code in cda_mutation_factory and also write a bit of code to translate the mutation JSON into phenopacket items - glad to hack on this with you
@justaddcoffee @sujaypatil96 I added a draft PR with a class that builds heavily on @sujaypatil96's gist. The class can fetch variants for a subject ID.
The class is, however, not hooked up to the rest of the framework yet. Unfortunately, I cannot work on that this week, I'm taking 3 days off starting with Wed.
Do you guys think you can look into this? Probably use it instead of the CdaMutationFactory
in CdaTableImporter
plus try to fill the VariationDescriptor
with missing fields, if possible (e.g. tumor/normal depths, gene..)?
great @ielis ! thanks
@sujaypatil96 do you have any time this week to hook up Daniel's code into CdaTableImporter
in place of the CdaMutationFactory
, and also try to get gene info from the GDC API?
@ielis thanks for working on https://github.com/monarch-initiative/oncoexporter/pull/81 it looks really good!
@justaddcoffee i'm mostly working on some high priority NMDC tasks for the rest of the week, but if I get done with them early I can take a look at hooking it up with the rest of the framework.
@sujaypatil96 okay, no worries at all - NMDC I think should take precedence
@justaddcoffee happy to take a look at hooking up the code from @ielis in https://github.com/monarch-initiative/oncoexporter/pull/81 with the rest of the framework tomorrow if no one else is working on it.
Okay great @sujaypatil96
I don't think anyone else is currently working on this
Sounds good! I'll work on sometime today/tomorrow.
We'd like to pull mutation data from GDC directly if possible
@sujaypatil96 pointed us to some code that might help here - see cell 16: https://github.com/cancerDHC/example-data/tree/main/cptac2-subject-09CO022
This code also might be useful for extract things from what the code above gets from GDC https://github.com/cancerDHC/example-data/tree/main
cc @sujaypatil96 @msierk @ielis