monarch-initiative / mondo

Mondo Disease Ontology
http://obofoundry.org/ontology/mondo
Creative Commons Attribution 4.0 International
235 stars 53 forks source link

Running `refresh-merged` returns out of memory error #8308

Open twhetzel opened 2 days ago

twhetzel commented 2 days ago

I ran into this in October and was able to solve by setting the memory to 60GB, however this time running sh run.sh make refresh-merge (following these docs: https://mondo.readthedocs.io/en/latest/editors-guide/import-terms-for-logical-axioms/), I am still getting this error.

    annotate --ontology-iri http://purl.obolibrary.org/obo/mondo/imports/merged_import.owl annotate -V http://purl.obolibrary.org/obo/mondo/releases/2024-11-06/imports/merged_import.owl --annotation owl:versionInfo 2024-11-06 convert -f ofn --output imports/merged_import.owl.tmp.owl && mv imports/merged_import.owl.tmp.owl imports/merged_import.owl; fi
Killed
make[1]: *** [Makefile:448: imports/merged_import.owl] Error 137
rm imports/foodon_terms.txt imports/omo_terms.txt imports/ncbigene_terms.txt imports/envo_terms.txt imports/ncit_terms.txt
make[1]: Leaving directory '/work/src/ontology'
make: *** [Makefile:481: refresh-merged] Error 2

We need this process to work as part of the general SOP for the Mondo release cycle as well as regular curation for adding entities into Mondo such as genes.

matentzn commented 2 days ago

"Killed" often does not mean that the process was out of memory - it usually means that the process was trying to allocate more memory than was available to docker. This could have different reasons, for example, if docker has already allocated some memory. For example:

  1. docker is assigned 80G
  2. Your process is assigned 70G (Ie. ROBOT_JAVA_ARGS etc)
  3. 10GB + is already occupied in docker for some reason (there are some, including another running process)

I would restart docker, make sure your docker has 80GB assigned, your process 60GB, and no other process is running.

If this does not work we will have to work with Kevin to create a NCBIgene slim during the ncbigene ingest pipeline which is suitable for ontologies (easy enough, but lets first try the above).

twhetzel commented 1 day ago

I re-ran the refresh-merged goal after restarting Docker and set to 80GB and ran with `export "MEMORY_GB=60" with no other processes running. After about 40 min it failed again, same error as before:

remove  --term rdfs:label  --term IAO:0000115  --term IAO:0000116  --term IAO:0100001  --term owl:deprecated -T imports/merged_terms_combined.txt --select complement --select "annotation-properties" \
    query --update ../sparql/inject-subset-declaration.ru --update ../sparql/inject-synonymtype-declaration.ru --update ../sparql/postprocess-module.ru \
    annotate --ontology-iri http://purl.obolibrary.org/obo/mondo/imports/merged_import.owl annotate -V http://purl.obolibrary.org/obo/mondo/releases/2024-11-06/imports/merged_import.owl --annotation owl:versionInfo 2024-11-06 convert -f ofn --output imports/merged_import.owl.tmp.owl && mv imports/merged_import.owl.tmp.owl imports/merged_import.owl; fi
Killed
make[1]: *** [Makefile:448: imports/merged_import.owl] Error 137
rm imports/foodon_terms.txt imports/omo_terms.txt imports/ncbigene_terms.txt imports/envo_terms.txt imports/ncit_terms.txt
make[1]: Leaving directory '/work/src/ontology'
make: *** [Makefile:481: refresh-merged] Error 2

My Docker setting Docker-Resource Setting

@matentzn any other suggestions? If not, how do we get the alternative underway "have to work with Kevin to create a NCBIgene slim during the ncbigene ingest pipeline which is suitable for ontologies"? It would be great if this can be done by Week 3 (Nov. 18) of the Mondo Release Cycle SOP so I can refresh the imports as part of the SOP on Friday, Nov.22.

matentzn commented 1 day ago

@kevinschaper can you help with this? Would it be possible, given a set of taxon ids (and gene ids) to efficiently subset the ncbigebe ingest before it makes is way into the Mondo pipeline?

kevinschaper commented 1 day ago

Ooh, we already filter by taxon, but we could absolutely make an additional rdf file that starts from the original and filters down to a subset of genes. I think we’d subset the primary tsv output to just the genes, then use kgx to produce rdf from that.

twhetzel commented 1 day ago

Awesome, thanks @kevinschaper! Let me know what I need to do here once that is ready.

kevinschaper commented 1 day ago

@twhetzel how should I get the gene list out of mondo?

twhetzel commented 1 day ago

Hmm, I can point you to the properties that are used in Mondo for the genes, but I'm confused the overall process here since we run the refresh-merged goal in order to get the genes into Mondo. Will what you're thinking still work to get new genes etc. into Mondo? FWIW, this is the process we use https://mondo.readthedocs.io/en/latest/editors-guide/import-terms-for-logical-axioms/

matentzn commented 1 day ago

Hmmm is there no other way then to do the module extraction so early? No @twhetzel this won't work... I guess the problem is that we have requested all genes for all TACA for which even a single disease is mentioned.. maybe that is not needed, and we can provide a much smaller list of genes?

Alternatively we have to do some preprocessing of the file with something other than robot which is more m more efficient..

kevinschaper commented 1 day ago

What about pulling the tsv from the ncbi ingest, filtering it to a subset of rows, and then using kgx to convert that little kgx tsv to rdf?

matentzn commented 1 day ago

That could work, yes; I assume there is no KGX filter command I can use? We need to provide a small custom Python script?

twhetzel commented 17 hours ago

To help me track this, who will try this next option for "pulling the tsv from the ncbi ingest, filtering it to a subset of rows, and then using kgx to convert that little kgx tsv to rdf"? @matentzn or @kevinschaper or me (given a few more pointers here on what to do)?