robot and arq pipeline to infer bl categories

Infer biolink categories in two steps:

Generates biolink categories for all classes in monarch.owl with a robot merge->reason->filter -> sparql construct pipeline
Generates biolink categories for all monarch rdf using Jena ARQ and a sparql construct query.

This runs in 38 minutes on monarch4 with make -j 5. However, it requires 60G of memory for ClinVar, and 30 for robot and some larger turtle files (Panther, MGI). Since there's no evaluation across multiple triples we could split the ntriples into chunks prior to inferring categories and then merge.

Output from this makefile on our beta dataset is here - https://data.monarchinitiative.org/biolink-model/

This outputs a directory of shim files (just biolink:category triples), and concatenated versions of our dataset (raw output + shim).

I see this as complementary to the hardcoding approach, where we choose inference in the following scenarios:

We're already making a rdf:type triple that can be inferred over a rdf:type|subClassOf+ path (stay DRY)
We don't know the type at ingest time (CTD and Mesh)
There are important subtypes that we don't know at ingest time (Chebi Drug, Metabolite, etc)
The type for an identifier is ambiguous and may be applied differently for each source (OMIM percent sign Ids, see coriell)
When hardcoding sequence feature categories is non-trivial and error-prone, such as in MGI and ZFIN

Where we should hardcode:

Association categories, where inference is possible but messier and would require sparql generation in biolink ml
Where we have application driven hacks and want to override category rules (EFO phenotypes in gwascatalog, MPATH and OBA phenotypes in OMIA)
Places where we skip rdf:type, such as gene centric ingests where we assume the a more granular gene type will come from another ingest (although we could type these as SO:0000704 and it would be just as well)
Any other area where rdf:type|subClassOf|partOf+ entailment is not possible

monarch-initiative / dipper

robot and arq pipeline to infer bl categories #950