monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

add CTD drug-phenotype relationships #37

Closed nlwashington closed 6 years ago

nlwashington commented 9 years ago

Add a new module to bring in CTD (http://ctdbase.org/downloads/): http://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz the annotations in this file include both diseases and phenotypes (as they are mixed in MeSH).

at first, let's bring in only the asserted associations (rather than inferred).

this may require a new Drug2Phenotype association class.

this may also require bringing in the chemicals vocab: http://ctdbase.org/reports/CTD_chemicals.tsv.gz and the diseases vocab: http://ctdbase.org/reports/CTD_diseases.tsv.gz

nlwashington commented 9 years ago

@cmungall , my guess is that we don't have all of the MeSH terms in our exisiting ontologies, so we would need to bring them in here for the minimum id/label mapping. do we want to deal with integrating the MEDIC hierarchy into our monarch.owl file as a different task?

cmungall commented 9 years ago

Yes, this should be separate from dipper.

We could then either leave it to SciGraph to reconcile, or we could do as a SPARQL pre-processing step

nlwashington commented 9 years ago

ok, will you be adding the medic file as an import into monarch.owl? or into the disease ontology? or how should we specify bringing it in together with it's relationships (if desired)? should i make a ticket someplace?

kshefchek commented 9 years ago

So there are three types of direct evidence: therapeutic - 27471 entries marker/mechanism - 52861 marker/mechanism|therapeutic - 3286

Definitions: Therapeutic: A chemical that has a known or potential therapeutic role in a disease (e.g., chemical X is used to treat leukemia). Marker/mechanism: A chemical that correlates with a disease (e.g., increased abundance in the brain of chemical X correlates with Alzheimer disease) or may play a role in the etiology of a disease (e.g., exposure to chemical X causes lung cancer).

@mbrush @cmungall ideas for appropriate properties? Or should we make new ones?

cmungall commented 9 years ago

Evidence would go in the evidence ontology, but this seems to straddle evidence and the underlying biological relationship, which comes down to activation and inhibition

kshefchek commented 9 years ago

Does ECO define properties?

kshefchek commented 9 years ago

Worth noting, CTD doesn't contain any type of ID to represent these relationships, such as an experiement ID. How would go about adding evidence codes without this?

cmungall commented 9 years ago

It can always be a blank node, but I think we would prefer to generate a IRI - either UUID or key concatenation

kshefchek commented 9 years ago

@cmungall would these be assigned to the monarch namespace, for example: http://monarchinitiative.com/experiment/CTD:12345

@nlwashington it looks like I should be using the Assoc class, although it sets the interacting IDs as classes rather than individuals: g.add((s, RDF['type'], self.OWLCLASS)) g.add((o, RDF['type'], self.OWLCLASS)) g.add((s, p, o))

Is this correct or do we want to add some functionality to have a seperate function for creating individuals when applicable.

nlwashington commented 9 years ago

so, i have a general method in the Source class to make identifiers. we can visit if the implementation is the best, but at least the way it's generated is uniform across the pipeline. here, you can decide what bits make up the IRI (keys). you'll see that this is what i've done the the HPOAnnotations, as well as some of the others. since we are making the identifier, it's really only relevant in the monarch system, even though it is about CTD data. so you can specify that in our namespace as either 'MONARCH:' or using the "base" prefix, like ":"

also, i think we only want to initially ingest those relationships that are based on direct evidence, so there would be no inference score in those rows.

as for the triple, it would be between the two classes here. you will follow the pattern like we've done for the other association types:

  1. create the relationships between the disease and drug. drug --(rel)--> disease
  2. then make the reified relationship/association, to attach the drug and disease: assoc hasSubject drug assoc hasObject disease assoc dc[evidence] ECO code assoc dc[source] PMIDs

now the tricky part is the "rel" between the drug and the disease. i don't know if there's a "treats" relationship that could be used between these entities? and @cmungall it's not as easy as activates/deactivates because there's no way for us to know a priori what the mechanism of therapeutic action is. there is "partner in" (RO:0002461), but that doesn't feel right either.

kshefchek commented 9 years ago

I've contacted CTD to clarify the meaning of marker/mechanism|therapeutic. From what I can gather, in these cases some publications give evidence that chemical is a marker/mechanism, and some give evidence that it is therapeutic. One issue is that the CTD tables do not specify which references support which evidence, although this information is available on CTD's site. The only way I can figure out how to get this information is to use there batch query service. The issue there is that it is a web form rather than a REST service, there is a 4k ID limit, and we would need to generate this additional information for 22351 Pubmed IDs. I could try to hit the webform with python or we could generate this table manually.

kshefchek commented 9 years ago

@nlwashington should we also be adding this triple: assoc hasPredicate RO:12345

nlwashington commented 9 years ago

We didn't before, but does seem important to disambiguate potential conflicts. @cmungall, any issues with this? On Feb 5, 2015 6:06 PM, "Kent Shefchek" notifications@github.com wrote:

@nlwashington https://github.com/nlwashington should we also be adding this triple: assoc hasPredicate RO:12345

— Reply to this email directly or view it on GitHub https://github.com/monarch-initiative/dipper/issues/37#issuecomment-73169599 .

kshefchek commented 9 years ago

This is implemented for non-ambiguous assocations. @cmungall I still need to figure out the correct RO properties to make these linkages (therapeutic and marker/mechanism) as well as the correct ECO class for the association.

nlwashington commented 9 years ago

@mellybelly, i've placed relationship requests here: 'treats': https://code.google.com/p/obo-relations/issues/detail?id=46 'correlated with': https://code.google.com/p/obo-relations/issues/detail?id=47

nlwashington commented 9 years ago

@cmungall please update with the relevant relationships to use here and reassign back to @kshefchek.

nlwashington commented 9 years ago

@kshefchek there seems to be an accessory file that includes publications that you've manually acquired and is part of the pipeline. can you document how you got this file in the dipper module? how can we keep it updated?

also, did you contact them about if they have any way of disambiguating "marker" from "mechanism"? those seem to be conflated, and really mean very different things. without disambiguation, i think we can at best only say that things related by "marker/mechanism" are just correlated in some way.

kshefchek commented 9 years ago

I'll write a script to generate the files needed for their batch load service. It still requires manually uploading the files, or successive copy/pastes. I'll bug them again about having this available as a download.

For the marker/mechanism, if I recall correctly from previous correspondence, this is the granularity in their system, i.e., I don't think they store separate marker and mechanism attributes.

cmungall commented 9 years ago

For the marker case, how about a relation that states exactly that, e.g. 'is marker for' or 'is compound marker for' (that sounds awkward).

This is stronger than correlation, but weaker than causation. Not everything that correlates is an effective marker.

nlwashington commented 9 years ago

the problem with CTD is that they merge together those associations that are either (or both) a marker of and mechanism for the disease, and we can't disambiguate them.

mellybelly commented 9 years ago

Have we requested that they change this, at least in the long term?

nlwashington commented 9 years ago

for treats, we will use 'substance_that_treats' RO:0002606. will update code accordingly.

cmungall commented 9 years ago

for markers, we will use 'is marker for' RO:0002607 from next RO release.

Def: c is marker for d iff the presence or occurrence of d is correlated with the presence of occurrence of c, and the observation of c is used to infer the presence or occurrence of d. Note that this does not imply that c and d are in a direct causal relationship, as it may be the case that there is a third entity e that stands in a direct causal relationship with c and d.

nlwashington commented 9 years ago

moving drug item to parking lot milestone until we have drug pages to best QC this data.

mbrush commented 9 years ago

@nlwashington re: "for treats, we will use 'substance_that_treats' RO:0002606. will update code accordingly."

The RO 'substance_that_treats' is a relationship holds between a chemical 'c' and a disease 'd' - where c is capable of some activity that negative regulates or decreases the magnitude of d.

This is not the relationship described in the CTD, which is between a gene and disease. The relationship we need would be something like 'is therapeutic target for' - a relationship between a gene and a disease there the gene is targeted by therapies for the disease.

kshefchek commented 6 years ago

closing as the initial ingest is stable, more data incoming from @balhoff's work for translator.