monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
14 stars 1 forks source link

Orphanet gene to disease ingest proposal #378

Closed kevinschaper closed 5 months ago

kevinschaper commented 1 year ago

We haven't yet added Orphanet gene to disease associations to Monarch KG.

@caufieldjh and I whipped up a quick initial transform in a standalone repo, using yq to to convert the Orphanet XML into JSON, and then Koza to transform to kgx.

We'll likely want to pay special attention to mapping Orphanet association types to biolink predicates, and possibly qualifiers. (@sabrinatoro @sierra-moxon, what do you think?)

Here is our starting place for the predicates

RELATION_TYPE_MAP = {
    "Disease-causing germline mutation(s) in": "biolink:condition_associated_with_gene",
    "Disease-causing germline mutation(s) (loss of function) in": "biolink:condition_associated_with_gene",
    "Disease-causing germline mutation(s) (gain of function) in": "biolink:condition_associated_with_gene",
    "Role in the phenotype of": "biolink:condition_associated_with_gene",
    "Major susceptibility factor in": "biolink:condition_associated_with_gene",
    "Disease-causing somatic mutation(s) in": "biolink:condition_associated_with_gene",
    "Candidate gene tested in": "biolink:related_to",
    "Part of a fusion gene in": "biolink:condition_associated_with_gene",
    "Biomarker tested in": "biolink:has_biomarker",
}
sagehrke commented 1 year ago

Hey team! Please add your planning poker estimate with Zenhub @putmantime @monicacecilia @cmungall @kevinschaper @amc-corey-cox

sagehrke commented 1 year ago

@monarch-initiative/monarch-internal Please take a moment to review this proposed ingest, make comments, and ask questions.

Once reviewed, follow these directions to vote:

If you approve of moving forward with this ingest, mark this comment with a 👍 If you reject the proposal, mark this comment with a 👎 Voting is open to all Monarch Internal and ends on December 12th, 2022.

pnrobinson commented 1 year ago

Please note that the HPO team has a lot of code for doing this and ideally we will use this code and/or extend and adapt it rather than reinventing this wheel. Depending on what kinds of data we want, it may already be in the phenotype.hpoa file.

RichardBruskiewich commented 1 year ago

Thanks @kevinschaper for these initial mappings.

@sierra-moxon, @cmungall, I guess we haven't yet emphasized sequence variants in Monarch, but conflate such variants with 'genes'? Maybe that suffices for Monarch given it's scope and remit, but just posing the question, since the Biolink Model does have suitable category and association classes for more precision in this regard.

Otherwise, I wonder how lossy some of the predicate mappings are below with respect to biological interpretation, although "biolink:condition_associated_with_gene" is a decent 'catch all' predicate (do we have to be careful about the direction of the association, via scrutiny of the subject and object categories?).

In one or two cases below, I also wonder if adding a new predicate to Biolink might add greater power to the model, e.g. "Candidate gene tested in": "biolink:related_to" mapping seems pretty genetic. Maybe we need something like "biolink:candidate_gene_for" to flag such statements, along with associated assay evidence, that would be welcomed by researchers of the knowledge base?


RELATION_TYPE_MAP = {
    "Disease-causing germline mutation(s) in": "biolink:condition_associated_with_gene",
    "Disease-causing germline mutation(s) (loss of function) in": "biolink:condition_associated_with_gene",
    "Disease-causing germline mutation(s) (gain of function) in": "biolink:condition_associated_with_gene",
    "Role in the phenotype of": "biolink:condition_associated_with_gene",
    "Major susceptibility factor in": "biolink:condition_associated_with_gene",
    "Disease-causing somatic mutation(s) in": "biolink:condition_associated_with_gene",
    "Candidate gene tested in": "biolink:related_to",
    "Part of a fusion gene in": "biolink:condition_associated_with_gene",
    "Biomarker tested in": "biolink:has_biomarker",
}
cmungall commented 1 year ago

I will take a closer look later, but I think it would help if there was more scoping metadata at the top of this issue. There is a lot of info in the orphanet XML files. As Peter says, some of this already captured as d2p associations, no need to do this twice.

It looks like the focus here is on g2d? Note these are also ingested in the HPO site and also in Exomiser

@RichardBruskiewich regarding lossiness, we should preserve the original predicate in original_predicate, and the other info can be captured in qualifiers.

RichardBruskiewich commented 1 year ago

@cmungall good point about original_predicate. I'm not sure we're yet making a habit of doing that. @kevinschaper might wish to check.

That said, part of my point above was whether or not there are more precise Biolink predicates mapping onto some of the original Orphanet predicates above?

kevinschaper commented 1 year ago

We haven't done original_predicate anywhere yet. I'm not sure if we have other ingests with a proper CURIE predicate that we can capture, but it looks like we can here. To make sure that detail doesn't get lost, I added it to our proof of concept ingest.

kevinschaper commented 1 year ago

Sorry for not replying earlier @pnrobinson. Assuming we do want to bring in g2d edges from Orphanet, would it make sense to produce some kind of tsv output of their g2d content from your pipeline so that we don't need to touch the original xml? I think especially having the predicate mapping occur in just one piece of code would be ideal.

sagehrke commented 1 year ago

We will talk about this in a data call in January 2023.

kevinschaper commented 5 months ago

We bring in Orphanet via the HPOA pipeline