monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

add CTD gene-disease (direct) associations #106

Closed nlwashington closed 9 years ago

nlwashington commented 9 years ago

We need the CTD gene-disease associations:

from http://ctdbase.org/downloads/CTD_genes_diseases.tsv.gz

these are relatively straightforward. we only want to pull the direct (not-inferred) associations. each row becomes a G2PAssociation. you'll need to follow the pattern of creating a BNode of "some variant of Gene X" to link to the disease. you can see how that was done in OMIM. (we can consider abstracting this as a method to the G2PAssociation.)

Field structure is: GeneSymbol, GeneID (NCBI Gene identifier), DiseaseName, DiseaseID (MeSH or OMIM identifier), DirectEvidence ('|'-delimited list), InferenceChemicalName, InferenceScore, OmimIDs ('|'-delimited list), PubMedIDs ('|'-delimited list)

Luckily, they use NCBIGene ids, so that's easy.

I would say that IF there is a only one OMIM id for the DiseaseID or in the OmimIDs list, use that preferentially over any MeSH id. @cmungall will comment on if we should include or exclude annotations to MeSH for the first pass due to MONDO restrictions.

nlwashington commented 9 years ago

these have the following as "evidence". but it's really the relationship:

so, there's actually different kinds of associations to make here; not just "has_phenotype".

so, because we don't know if it is causal, the ones for "marker/mechanism" should probably just have the relationship MONARCH:correlates_with (until @cmungall adds a correlation relationship to RO, and we'll switch out to the proper RO)

and for therapeutic, that's more complicated. let's also leave those out for now...these can be added after we finish the modeling related to variant -> "activity of gene X" -> disease.

bryanlaraway commented 9 years ago

@nlwashington On your last comment, how about for direct evidence = 'marker/mechanism|therapeutic.' Leave out or include with the MONARCH:correlates_with?

bryanlaraway commented 9 years ago

This is just about wrapped up, but testing the output. The disease2gene file has 40,148,312 rows. It will complete when setting a limit, but haven't had it complete on the full file. Might have to just push it to the Bamboo server and let it run there. Running locally on my laptop, it processes about 555 rows per second, which puts the run time at just over 20 hours. Not sure if it will just die when it goes to write the turtle output, as I wouldn't be surprised if writing all the nodes created from these rows overloads the memory.

nlwashington commented 9 years ago

the disease processing should finish... we are only taking those that are direct associations, so there are way less than 40M rows of those. I'll take a look to see if we can speed that up.

and yes, we'd probably use that generic correlates_with for those that are the combo "evidence" because we just don't know what it actually is!

nlwashington commented 9 years ago

@cmungall do we have a final verdict on the proper relationship to use for something like "correlates with"?

nlwashington commented 9 years ago

ok, so the general pattern here so far is to use an OMIM id for a disease-gene association when an OMIM id is available (and to use it preferentially even if we have a MeSH id, if the MeSH:OMIM is 1:1). since we don't know if the association is actually a marker or a (etiological mechanism), we leave this as some kind of correlative relationship. if there is no OMIM id, then we can use the MeSH id. (many of the cancers are annotated to MeSH, and more atomic-phenotypes.) however there are some oddities in MeSH. For example:

As with OMIM, we create an anonymous variant locus, and link that to the disease/phenotype.

nlwashington commented 9 years ago

we will use the is_marker_for relationship here for the "marker/mechanism" relationship.

nlwashington commented 9 years ago

screen shot 2015-06-22 at 1 30 26 pm

nlwashington commented 9 years ago

suggestions for possible scrubbing, as well as pushing feedback to the source: Dog Diseases MESH:D004283 Disease Models, Animal MESH:D004195 Genetic Diseases, Inborn MESH:D030342 Genetic Diseases, X-Linked MESH:D040181 Genetic Predisposition to Disease MESH:D020022

nlwashington commented 9 years ago

I've written CTD personnel to ask clarification about the odd "diseases".

nlwashington commented 9 years ago

i've temporarily added scrubbing of the above identifiers in commit d82f2d7...no associations will be added to our turtle files for those five ids.

nlwashington commented 9 years ago

@mbrush please review cmaps and close if satisfied.