hubmapconsortium / ontology-api

The HuBMAP Ontology Service
MIT License
4 stars 3 forks source link

Ontology: Automate Knowledge Graph Release #152

Closed AlanSimmons closed 1 year ago

AlanSimmons commented 1 year ago

Questions:

  1. What can be automated in this workflow (Slide 2)?

  2. How long would it take to spin up a Docker container for the neo4j ontology database from scratch?

  3. What is the benefit of a Docker deployment?

What can be automated?

To start the process, it is first necessary to obtain 12 CSV files of UMLS concept and relationship data. We currently obtain this data as exports from Neptune (Slide 5). The CSV files are themselves simple conversions of UMLS RRF files downloaded from UMLS MetamorPhoSys and Semantic Network. The current process leverages this work, as DBMI needs the RRF files to refresh Neptune.

Once we have the 12 UMLS CSV files, we successively enhance them, adding nodes and relationships from specified ontologies. This entails:

In other words, we do this for each ontology that we add. If we do 2 ontologies A and B, then we wind up with 3 sets of CSVs:

After running for all ontologies we have 12 CSV files that can be used to generate the neo4j, as well as a lot of intermediate copies of the CSV files.

How long would it take to spin up a Docker container from scratch?

The rate limiting steps are:

  1. Getting the "seed" set of CSV files.
  2. Running the OWLNETS converters to generate the OWLNETS files.

It may be possible to automate the first step by using PyMedTermino; however, it may not be worth the effort, as opposed to just downloading the files manually.

The second rate limiting step is interesting. A complete end-to-end serial run of the build script takes nearly a day to complete (22 hours on September 15); however, the bulk of this time is taken up by the conversion of just 2 ontologies--CHEBI (8 hours) and PR (11 hours). The other ontologies require anywhere from 1 to 8 minutes to complete. This is a case of the tail wagging the dog.

My laptop (32 GB Apple M1 2.7 MHz 10 core) grinds for a day on CHEBI and PR, because it's doing it serially. This could be parallelized by instantiating one Docker per ontology; however, it may be worth asking whether we actually need CHEBI and PR. They're pretty expensive.

It is possible to run the build scripts without doing the PheKnowLator conversion. However, this means that we'd need to archive the OWLNETS files somewhere.

What is the benefit of a Docker deployment?

The Docker file does a lot of things, including:

That argues for deploying as a container. However, the fact that the neo4j database itself contains licensed content means that we might have to mount the database file external to the container. This means that the Docker file would essentially generate an empty neo4j instance that points out to a set of datafiles. Is there value in doing this? This just seems to complicate things.

The question may be more "What is the benefit of pushing this to DockerHub?"