monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

robot and arq pipeline to infer bl categories #950

Closed kshefchek closed 4 years ago

kshefchek commented 4 years ago

Infer biolink categories in two steps:

  1. Generates biolink categories for all classes in monarch.owl with a robot merge->reason->filter -> sparql construct pipeline
  2. Generates biolink categories for all monarch rdf using Jena ARQ and a sparql construct query.

This runs in 38 minutes on monarch4 with make -j 5. However, it requires 60G of memory for ClinVar, and 30 for robot and some larger turtle files (Panther, MGI). Since there's no evaluation across multiple triples we could split the ntriples into chunks prior to inferring categories and then merge.

Output from this makefile on our beta dataset is here - https://data.monarchinitiative.org/biolink-model/

This outputs a directory of shim files (just biolink:category triples), and concatenated versions of our dataset (raw output + shim).

I see this as complementary to the hardcoding approach, where we choose inference in the following scenarios:

  1. We're already making a rdf:type triple that can be inferred over a rdf:type|subClassOf+ path (stay DRY)
  2. We don't know the type at ingest time (CTD and Mesh)
  3. There are important subtypes that we don't know at ingest time (Chebi Drug, Metabolite, etc)
  4. The type for an identifier is ambiguous and may be applied differently for each source (OMIM percent sign Ids, see coriell)
  5. When hardcoding sequence feature categories is non-trivial and error-prone, such as in MGI and ZFIN

Where we should hardcode:

  1. Association categories, where inference is possible but messier and would require sparql generation in biolink ml
  2. Where we have application driven hacks and want to override category rules (EFO phenotypes in gwascatalog, MPATH and OBA phenotypes in OMIA)
  3. Places where we skip rdf:type, such as gene centric ingests where we assume the a more granular gene type will come from another ingest (although we could type these as SO:0000704 and it would be just as well)
  4. Any other area where rdf:type|subClassOf|partOf+ entailment is not possible
TomConlin commented 4 years ago

+1