monarch-initiative / oncoexporter

Cancer data to GA4GH phenopacket
https://monarch-initiative.github.io/oncoexporter
MIT License
6 stars 1 forks source link

pull mutation data from GDC #80

Closed justaddcoffee closed 3 months ago

justaddcoffee commented 4 months ago

We'd like to pull mutation data from GDC directly if possible

@sujaypatil96 pointed us to some code that might help here - see cell 16: https://github.com/cancerDHC/example-data/tree/main/cptac2-subject-09CO022

This code also might be useful for extract things from what the code above gets from GDC https://github.com/cancerDHC/example-data/tree/main

cc @sujaypatil96 @msierk @ielis

sujaypatil96 commented 4 months ago

Looks like GDC has an API that we can use to pull in the data from directly: https://docs.gdc.cancer.gov/API/Users_Guide/Getting_Started/#getting-started

We don't need to use the CDA library to bring this data in. We can write a python client/script that pulls data by interacting with the above API.

sujaypatil96 commented 4 months ago

I can work on modifying the code here: https://github.com/monarch-initiative/oncoexporter/blob/develop/src/oncoexporter/cda/cda_mutation_factory.py to bring in mutation data directly from GDC.

justaddcoffee commented 4 months ago

great @sujaypatil96!

msierk commented 4 months ago

My recollection from what Brian & Matt said at the hackathon was that the CRDC-H model did not have mutation information included in it yet. If you look at the Appendix A for the GDC API, I do not see any of the fields that are available in the CDA mutation endpoint.

My view is that we do not need to be restricted to using CDA if it doesn't do what we want, but that we should not recreate existing capabilities unnecessarily. The CDA has some process for producing the mutation table, and it makes sense to me to at least try to understand how they did that before trying to build our own from scratch.

However, if Sujay can figure out an easy way to get the mutation data directly from GDC I certainly don't have any objections to do things that way.

sujaypatil96 commented 4 months ago

Let's consider a subject/case (a8b1f6e7-2bcf-460d-b1c6-1792a9801119) browsable on GDC: https://portal.gdc.cancer.gov/cases/a8b1f6e7-2bcf-460d-b1c6-1792a9801119

My understanding is that the mutation information that we want to pull for cases from GDC is what's available under the "MOST FREQUENT SOMATIC MUTATIONS" section/table on the above webpage say. To obtain this data we would need to query the "SSM (Simple Somatic Mutation)" endpoint. The GDC mutation data can be found at /ssms endpoint.

justaddcoffee commented 4 months ago

thanks all

I'd suggest that Sujay does a first pass at collecting MOST FREQUENT SOMATIC MUTATIONS using the GDC API, and then we can have a closer look. Does that sound reasonable?

msierk commented 4 months ago

Ahh, my problem was I did not look at the "Data Analysis" page, which describes the mutation endpoints: https://docs.gdc.cancer.gov/API/Users_Guide/Data_Analysis/

sujaypatil96 commented 4 months ago

I've experimented with pulling mutation data from GDC directly here: https://gist.github.com/sujaypatil96/5659f766abeed7adf52fb6ce771e5552

sujaypatil96 commented 4 months ago

I was looking at the list of mutation fields to pull in (from CDA) in the oncoexporter code and saw this: https://github.com/monarch-initiative/oncoexporter/blob/develop/src/oncoexporter/cda/cda_mutation_factory.py#L18-L50

Based on that list (and the full list of fields that we can pull mutation information for here), I wrote a quick Python script to demo/illustrate how we can use the GDC API (specifically the /ssms endpoint) to pull in mutation information for a specific case (case_id).

sujaypatil96 commented 4 months ago

There's more information available at the /ssm_occurences endpoint which we can retrieve. See https://docs.gdc.cancer.gov/API/Users_Guide/Data_Analysis/ for examples.

justaddcoffee commented 4 months ago

Thanks @sujaypatil96! We'll take a look hopefully today

justaddcoffee commented 4 months ago

cc: @pnrobinson

justaddcoffee commented 4 months ago

@ielis could you have a go at incorporating Sujay's code into oncoexporter?

I think we just need to put Sujay's code in cda_mutation_factory and also write a bit of code to translate the mutation JSON into phenopacket items - glad to hack on this with you

ielis commented 4 months ago

@justaddcoffee @sujaypatil96 I added a draft PR with a class that builds heavily on @sujaypatil96's gist. The class can fetch variants for a subject ID.

The class is, however, not hooked up to the rest of the framework yet. Unfortunately, I cannot work on that this week, I'm taking 3 days off starting with Wed.

Do you guys think you can look into this? Probably use it instead of the CdaMutationFactory in CdaTableImporter plus try to fill the VariationDescriptor with missing fields, if possible (e.g. tumor/normal depths, gene..)?

justaddcoffee commented 4 months ago

great @ielis ! thanks

@sujaypatil96 do you have any time this week to hook up Daniel's code into CdaTableImporter in place of the CdaMutationFactory, and also try to get gene info from the GDC API?

sujaypatil96 commented 4 months ago

@ielis thanks for working on https://github.com/monarch-initiative/oncoexporter/pull/81 it looks really good!

@justaddcoffee i'm mostly working on some high priority NMDC tasks for the rest of the week, but if I get done with them early I can take a look at hooking it up with the rest of the framework.

justaddcoffee commented 4 months ago

@sujaypatil96 okay, no worries at all - NMDC I think should take precedence

sujaypatil96 commented 4 months ago

@justaddcoffee happy to take a look at hooking up the code from @ielis in https://github.com/monarch-initiative/oncoexporter/pull/81 with the rest of the framework tomorrow if no one else is working on it.

justaddcoffee commented 4 months ago

Okay great @sujaypatil96

I don't think anyone else is currently working on this

sujaypatil96 commented 3 months ago

Sounds good! I'll work on sometime today/tomorrow.