hubmapconsortium / ontology-api

The HuBMAP Ontology Service
MIT License
4 stars 3 forks source link

Ontology: Orchestration of UMLS CSV exports for ontology neo4j #148

Closed AlanSimmons closed 1 year ago

AlanSimmons commented 2 years ago

ISSUE The ontology neo4j database depends on a set of CSV files that represent UMLS concepts and relationships. The CSV files must be available to the Dockerfile for the ontology database at startup.

To date, these files have been generated manually by executing a series of python scripts in Jupyter notebook. The scripts connect to the Neptune database and work with tables exported from the UMLS via the application MetamorphoSys. The CSV files are then copied to a staging directory.

The scripts are both static and decoupled--i.e., scripts do not depend on the output of other scripts.

Issues with the manual process:

  1. Some of the CSV file exports are so large (CUI-CUIs.CSV has 25+M rows) that the Pulse VPN times out and closes the Jupyter session before the export is complete (after around 3 hours)
  2. There appears to be a branching problem in the script. I have two different versions of the export notebook. The two versions appear to differ in the final step, in which NDC concepts are appended to the CODES.csv export file. The version in github (https://github.com/dbmi-pitt/UMLS-Graph/blob/master/UMLS-Graph-Extracts.ipynb) uses pandas functionality to generate the final append; however, it appears that it would be possible to do this using CTEs as part of the task to generate the CODES.csv task. In other words, CODES.csv can be generated with one script instead of two.

PROPOSED SOLUTION I believe that the export of UMLS data should be automated and executed from within Neptune instead of manually in a Jupyter session.

STEPS

  1. Resolve the differences between the two versions of the scripts. I think that the tasks related to NDC concepts can be incorporated into the earlier step that generates the CODES.CSV file.
  2. Package the scripts into an Oracle process. The process would be essentially "execute these (11 or 12) queries, exporting results to CSV files."
AlanSimmons commented 2 years ago

The proposed direction is to convert the existing Jupyter notebook script into a pure Python script.

AlanSimmons commented 2 years ago

Jupyter notebook converted to python script named UMLS-Graph-Extracts.py that accepts the ID of the UMLS schema in Neptune--e.g.,

python UMLS-Graph-Extracts.py UMLS2021AB

Currently, I do not have permissions to push code to the dbmi-pitt/UMLS-graph repo.

AlanSimmons commented 2 years ago

Added the python script to the UMLS-graph repo.

AlanSimmons commented 2 years ago

Note: I did not change the original script workflow. The steps to append NDC data to the exports of CODEs.CSV and CUI-CODEs.CSV are at the end of the new script, too.

AlanSimmons commented 2 years ago

After discussions with @computationdoc , I understand that the code in the notebook script may not execute the most up to date algorithm. If this is the case, I'll need to update the Python script. It seems that any differences would be minor, as far as code goes. The CSV output, on the other hand, might be different.

AlanSimmons commented 1 year ago

@shirey I recommend that we either move this to Backlog or close. We're likely to rely on the manual process for some time.

AlanSimmons commented 1 year ago

This will be moved to the new UBKG repo.