Boost efficiency of CHEBI/NCBITaxon imports?

pbuttigieg commented 5 years ago

Hi all,

In the day-to-day of mirroring and importing minimal subgraphs, CHEBI and NCBITaxon stick out as outliers due to their relatively monstrous size.

Is there any magic that can be done to reduce the memory requirements for handling these?

matentzn commented 5 years ago

@pbuttigieg I 100% feel your pain, and share it on a daily basis. PR, CHEBI, NCBITAXON. Ok, let me say it like this: You can reduce time by reducing network traffic, and that you do by tapping the gzipped versions of an ontology; many such as CHEBI have one. There is a patch for ROBOT in the pipeline to handle gzipped IRIs correctly (its fixed but not published), until then, wget the gzipped ontology and just open the file with ROBOT the usual way.

The other way to handle this would be to lobby the OBO foundry to define "useful subsets" of these ontologies; this would be best, but I cant see it happening soon.

Unfortunately, this is not a ROBOT issue.. :/

I personally excluded these huge ontologies from the automatic update pipelines ala ODK, and only refresh them explicitly overnight when I feel like it. Not good.

jamesaoverton commented 5 years ago

ROBOT has been able to handle gzipped files from disk for a while now. There was a bug with --import-iri for gzipped files that has been fixed by #537 and will be included in the next release.

The NCBI Taxonomy is quite simple. It's just driven by some tables, so it's easy enough to write custom code for working with it, along the lines of https://github.com/obophenotype/ncbitaxon

For the other large ontologies that take advantage of OWL, sometimes you can get away with just using SPARQL. If you need to do proper OWL work, you need OWLAPI, and large ontologies just require a lot of memory. ROBOT tries to be a lightweight layer on top of OWLAPI -- if anybody sees inefficiencies in ROBOT, let us know and we'll try to optimize. The main exception is ROBOT query, which by default loads the ontology with OWLAPI and then copies it into Apache Jena. For large ontologies try the new --tdb option that will load with Jena directly and store the triples on disk: http://robot.obolibrary.org/query#executing-on-disk

@matentzn is right that we could work toward providing useful subsets of the large ontologies.

cmungall commented 5 years ago

I agree with all the above, re gz and tbd

An overly complicated solution would be to put all ontologies in S3 buckets and make it easier for people to run things on EC2 colocated, perhaps via some build-as-a-service [future grant idea].

We could also implement SLME over SPARQL. Or just MIREOT (usually the formal guarantees of SLME are not required for something like ncbitaxon). Or we could simply have ROBOT call the OntoFox API. Or you could simply implement OntoFox in your pipeline. (I think that is in decreasing order of work for ROBOT developers).

There are various dependency issues here. If we are to depend on an external SPARQL endpoint, I would rather have it depend on one that implements standard patterns for organizing ontologies into NGs (another hobby horse of mine).

Note for ncbitaxon we do have a slim (that is actually quite chonk) http://obofoundry.org/ontology/ncbitaxon I need to add a description to this....

I thought we had something in ODK to make it easy to specify a slim rather than a full product but it looks not to be the case..

It's hard to say what the right subset of chebi would be. Sometimes the 'naturally occurring' subset is most useful, but maybe not for Pier's use case, where we might want X-contaminated soil, where X is an anthropogenic product.

pbuttigieg commented 5 years ago

Thanks @cmungall @matentzn @jamesaoverton for the perspective and guidance

Strong +1 for going for a Data as a Service model (push the code to a remote data hosting space and pull back only results).

Until then, and because we have the luxury of servers plugged in to beast-mode internet/network resources at our institute, I'm just spinning up the docker container there to do any heavy lifting.

Would be cool to see this issue punted to OBO Operations if you feel it's better there.

matentzn commented 5 years ago

Just as a side note: with the new ROBOT (1.4.3) you can now do this:

robot merge -I http://purl.obolibrary.org/obo/chebi.owl.gz -o mirror/chebi.owl

Which will save you a great deal of time when running CHEBI.

ontodev / robot

Boost efficiency of CHEBI/NCBITaxon imports? #552