monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
14 stars 1 forks source link

Gene Orthology (Panther) ingest #158

Closed RichardBruskiewich closed 2 years ago

RichardBruskiewich commented 2 years ago

This is a initial resolution of issue https://github.com/monarch-initiative/monarch-ingest/issues/156.

This PR of the ingest processes gene-to-gene orthology relationship mappings whose gene node and association edge instances are extracted by parsing the RefGenomeOrthologs dataset from PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System.

This release limits taxonomic coverage of orthology to human, mouse, rat, zebrafish, fruit fly, nematode, fission yeast and budding yeast (this list can be extended in the future, inside of the taxon mapping table in the ~/monarch-ingest/monarch_ingest/orthology/orthology_utils.py file).

Further iterations of the ingest will likely cover:

RichardBruskiewich commented 2 years ago

@kevinschaper I propagated your commit de462ce decision to download.yaml

kevinschaper commented 2 years ago

I noticed a lot of map lookup failures on this and fairly small output (only ~2100 associations), it seems like the code might be always looking for a gene id via the uniprot_2_gene map, even when the gene id is given in the file? I can dig in deeper on that next week.

RichardBruskiewich commented 2 years ago

I noticed a lot of map lookup failures on this and fairly small output (only ~2100 associations), it seems like the code might be always looking for a gene id via the uniprot_2_gene map, even when the gene id is given in the file? I can dig in deeper on that next week.

Hmm... yeah.. sorry.. I blissfully assume that the uniprot_2_gene is somewhat complete but that may be a false assumption. The Panther gene ids, though, are a mixed back of species-specific mappings, which is what prompted me to try to directly resolve the UniProt ID's in the uniprot_2_gene ...

OK.. let me know if and how I can help sort this issue out. Sorry about that!

RichardBruskiewich commented 2 years ago

The CI run failure (as of 4:30 pm on 21st Feb 2022) is due to a Pydantic model generation code curie parsing regex pattern not accepting CURIE prefixes with periods in them (e.g. like PANTHER.FAMILY). Tests pass if this is fixed.

kevinschaper commented 2 years ago

It's fixed in biolink-pydantic-model already, but we have to catch the other ingests up with model changes (mostly gene to pub becoming pub to gene) so that we can get this fix. @glass-ships is almost done with that in #170, so we need to get that PR in next, and this one should follow.

RichardBruskiewich commented 2 years ago

Good morning @putmantime, as @kevinschaper says, we patched the regex issue in the PydanticGen code, so the PANTHER.FAMILY curies should soon pass muster (once the CI code is updated with the latest releases for the Pydantic model).

In addition to the Pydantic 'bug', following discussions with Kevin, I posted a checklist of a few other aspects of the ingest (see https://github.com/monarch-initiative/monarch-ingest/issues/156#issuecomment-1048118380) - a unit test (mock koza) glitch; Biolink Model provenance tagging compliance; and ingest species selection - which could use some clarification/action.

The other open issue is whether or not to ingest any additional PANTHER datasets (e.g. gene - to - gene family; gene family - to - (pathway, GO) function associations). Chris Bizon mentioned, though, that the RENCI Automat actually pulls some of these knowledge bits in (?). Unsure where to draw the line here.

This PR is otherwise ready to merge.

RichardBruskiewich commented 2 years ago

@putmantime @kevinschaper the Panther PR is ready for your final review, approval and merging.