Questions:

What can be automated in this workflow (Slide 2)?
How long would it take to spin up a Docker container for the neo4j ontology database from scratch?
What is the benefit of a Docker deployment?

What can be automated?

To start the process, it is first necessary to obtain 12 CSV files of UMLS concept and relationship data. We currently obtain this data as exports from Neptune (Slide 5). The CSV files are themselves simple conversions of UMLS RRF files downloaded from UMLS MetamorPhoSys and Semantic Network. The current process leverages this work, as DBMI needs the RRF files to refresh Neptune.

Once we have the 12 UMLS CSV files, we successively enhance them, adding nodes and relationships from specified ontologies. This entails:

Generating files in OWLNETS format, either by running the PheKnowLator scripts or equivalents, such as the SimpleKnowledge converter (skowlnets.py).
Running the OWLNETS-UMLS-GRAPH script to generate new versions of the CSVs.

In other words, we do this for each ontology that we add. If we do 2 ontologies A and B, then we wind up with 3 sets of CSVs:

the original UMLS CSVs
UMLS + A
UMLS + A + B

After running for all ontologies we have 12 CSV files that can be used to generate the neo4j, as well as a lot of intermediate copies of the CSV files.

How long would it take to spin up a Docker container from scratch?

The rate limiting steps are:

Getting the "seed" set of CSV files.
Running the OWLNETS converters to generate the OWLNETS files.

It may be possible to automate the first step by using PyMedTermino; however, it may not be worth the effort, as opposed to just downloading the files manually.

The second rate limiting step is interesting. A complete end-to-end serial run of the build script takes nearly a day to complete (22 hours on September 15); however, the bulk of this time is taken up by the conversion of just 2 ontologies--CHEBI (8 hours) and PR (11 hours). The other ontologies require anywhere from 1 to 8 minutes to complete. This is a case of the tail wagging the dog.

My laptop (32 GB Apple M1 2.7 MHz 10 core) grinds for a day on CHEBI and PR, because it's doing it serially. This could be parallelized by instantiating one Docker per ontology; however, it may be worth asking whether we actually need CHEBI and PR. They're pretty expensive.

It is possible to run the build scripts without doing the PheKnowLator conversion. However, this means that we'd need to archive the OWLNETS files somewhere.

What is the benefit of a Docker deployment?

The Docker file does a lot of things, including:

loads the neo4j from CSV
sets constraints
sets the database to readonly

That argues for deploying as a container. However, the fact that the neo4j database itself contains licensed content means that we might have to mount the database file external to the container. This means that the Docker file would essentially generate an empty neo4j instance that points out to a set of datafiles. Is there value in doing this? This just seems to complicate things.

The question may be more "What is the benefit of pushing this to DockerHub?"

hubmapconsortium / ontology-api

Ontology: Automate Knowledge Graph Release #152

Questions:

What can be automated?

How long would it take to spin up a Docker container from scratch?

What is the benefit of a Docker deployment?