monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

Grouping issue: Automated testing #318

Open joeflack4 opened 1 year ago

joeflack4 commented 1 year ago

Overview

I felt it would be a good idea to create a general issue around automated testing (unit tests, integration tests, etc). Right now we don't have any tests at all, I believe.

Modules & specific cases to test

Details

These are some of the workflows that I think could most use some tests. For "whole pipeline itself", we would just download sources and run; dynamic inputs. For each of these other "modules", we probably should create some static input files. Integration tests (checking that outputs match expectations) are higher priority than unit tests (e.g. at the function level). - [ ] 1. E2E: Whole pipeline itself - [ ] 2. Migration pipeline (slurp) - [ ] 3. "Unmapped" artefact pipelines - [ ] 4. Exclusions pipeline - [ ] 5. prefix maps E.g. the bimaps, i.e. `mondo.sssom.config.yaml`'s `curie_map` (moot if fully replaced by the `extended_prefix_map` in there) and `config/prefixes.csv` (moot once it is dynamically generated from `mondo.sssom.config.yaml`). Basically we want to ensure it is bijective; i.e. no duplciate URI prefixes.

1. Whole pipeline

Details

[Nico wrote](https://github.com/monarch-initiative/mondo-ingest/pull/319#discussion_r1208355796): > Rather than creating many such tests, a better general strategy could be to run the whole odk test pipeline with a subset of the configured ontologies. I have never tried this, maybe something like: > ```shell > sh run.sh make build-mondo-ingest OTHER_SRC=components/ordo.owl components/doid.owl > ```

2. Migration pipeline (slurp)

Details

**Possible tests**: I. Mock terms Can create some static input files where we can create some mock terms that meet various conditions, and we can assert what we expect after running the workflow (e.g. whether or not the term appears in `slurp/%.tsv`). - excluded / not - deprecated /not - parents deprecated (0, 1, or all) - parents not meet narrow/exact match conditions? II. Output file should never have more than 1k rows

3. "Unmapped" artefact pipelines

Details

**Possible tests**: I. Every ingest should have no more than 50% unmapped classes The easiest way to check this is to use `scripts/unmapped_docs.py`. There is a function that creates the table in `unmapped.md` in the `docs/` which has these %'s calculated.

Creating test input files

GitHub action

Details

Should run: - Integration & unit tests - ODK `Makefile` `test` goal Should not run: - The whole pipeline: Not enough memory, and we'd want a separate action for that anyway. When: - `schedule`?: Doesn't make sense unless dynamic inputs - `push`: To `main` - `pull_request`: To `main`

Makefile set up

Details

We have an existing Makefile test goal, but this is for common cases for ODK projects. To incorporate integration tests and unit tests, we probably want something like: ```make test-all: test mondo-ingest-tests mondo-ingest-tests: python -m unittest discover ``` Nico thought about integration-test but we might have unit tests as well. I don't know if it makes sense to create a 2nd goal for unit-test; I don't see that we'll really want to run them separately.

Discussion

Module tests in isolation with static inputs

Details

### Possible future changes to static test files **Adding ontology files**: Although, there is some rationale to maybe creating a shortened, static version of ontology files, e.g. `ordo.owl` which will allow SPARQL queries, etc, to run faster. Will also prevent failures in cases where the ontology changes.

Caveats / more possible future changes

Details

**Makefile test setup**: I didn't make any dependencies for the tests goal, so if some of those files don't exist, it'll error out. Maybe this is a design flaw. I up the tests goal to conveniently automatically find and run all Python unit tests in the project. Perhaps a different way would be to split things up, like a `test-slurp` that has goal dependencies, and then a tests goal with `test-slurp` as its dependency.

joeflack4 commented 1 year ago

@matentzn We discussed testing today at our meeting. Rather than making a new issue, I updated this existing one with our thoughts.

The reason it's called a "grouping issue" is because I don't expect it to be something where we specifically pick a time to tackle the whole issue at once.

joeflack4 commented 1 year ago

@matentzn Btw, I ran make test today in mondo-ingest and the issue here is that NCIT fails because of memory.

Log

``` make test mondo-ingest.Makefile:39: warning: overriding commands for target `components/omim.owl' Makefile:447: warning: ignoring old commands for target `components/omim.owl' mondo-ingest.Makefile:54: warning: overriding commands for target `components/ordo.owl' Makefile:458: warning: ignoring old commands for target `components/ordo.owl' mondo-ingest.Makefile:73: warning: overriding commands for target `components/ncit.owl' Makefile:436: warning: ignoring old commands for target `components/ncit.owl' mondo-ingest.Makefile:84: warning: overriding commands for target `components/doid.owl' Makefile:392: warning: ignoring old commands for target `components/doid.owl' mondo-ingest.Makefile:106: warning: overriding commands for target `component-download-icd10cm.owl' Makefile:410: warning: ignoring old commands for target `component-download-icd10cm.owl' mondo-ingest.Makefile:112: warning: overriding commands for target `components/icd10cm.owl' Makefile:414: warning: ignoring old commands for target `components/icd10cm.owl' mondo-ingest.Makefile:127: warning: overriding commands for target `components/icd10who.owl' Makefile:425: warning: ignoring old commands for target `components/icd10who.owl' mondo-ingest.Makefile:142: warning: overriding commands for target `component-download-gard.owl' Makefile:399: warning: ignoring old commands for target `component-download-gard.owl' mondo-ingest.Makefile:146: warning: overriding commands for target `components/gard.owl' Makefile:403: warning: ignoring old commands for target `components/gard.owl' mondo-ingest.Makefile:151: warning: overriding commands for target `mondo-ingest-full.owl' Makefile:588: warning: ignoring old commands for target `mondo-ingest-full.owl' mondo-ingest.Makefile:545: warning: overriding commands for target `help' Makefile:659: warning: ignoring old commands for target `help' echo "ODK Makefile version: v1.5 (this is the version of the ODK with which this Makefile was generated, \ not the version of the ODK you are running)" &&\ echo "ROBOT version (ODK): " && robot --catalog catalog-v001.xml --version ODK Makefile version: v1.5 (this is the version of the ODK with which this Makefile was generated, not the version of the ODK you are running) ROBOT version (ODK): ROBOT version 1.9.1 robot --catalog catalog-v001.xml reason --input tmp/mondo-ingest-preprocess.owl --reasoner ELK --equivalent-classes-allowed asserted-only \ --exclude-tautologies structural --output test.owl && rm test.owl robot --catalog catalog-v001.xml verify -i tmp/mondo-ingest-preprocess.owl --queries ../sparql/properties-as-annotation-and-object-violation.sparql ../sparql/ordo-mapping-annotations-violation.sparql -O reports bPASS Rule ../sparql/properties-as-annotation-and-object-violation.sparql: 0 violation(s) PASS Rule ../sparql/ordo-mapping-annotations-violation.sparql: 0 violation(s) if [ true = true ] && [ true = true ]; then robot --catalog catalog-v001.xml merge -I http://purl.obolibrary.org/obo/doid.owl \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo-ingest/component-download-doid.owl annotate -V http://purl.obolibrary.org/obo/mondo-ingest/releases/2023-08-31/component-download-doid.owl --annotation owl:versionInfo 2023-08-31 -o tmp/component-download-doid.owl.owl; fi ^R beep if [ true = true ]; then robot --catalog catalog-v001.xml query -i "tmp/component-download-doid.owl.owl" -q "../sparql/doid-relevant-signature.sparql" tmp/doid_relevant_signature.txt; fi if [ true = true ]; then robot --catalog catalog-v001.xml remove -i tmp/component-download-doid.owl.owl --select imports \ rename --mappings config/property-map-1.sssom.tsv --allow-missing-entities true \ rename --mappings config/property-map-2.sssom.tsv --allow-missing-entities true \ remove -T tmp/doid_relevant_signature.txt --select complement --select "classes individuals" --trim false \ query \ --update ../sparql/[fix_omimps.ru](http://fix_omimps.ru/) \ --update ../sparql/[fix_hgnc_mappings.ru](http://fix_hgnc_mappings.ru/) \ --update ../sparql/[fix-labels-with-brackets.ru](http://fix-labels-with-brackets.ru/) \ --update ../sparql/[fix_complex_reification.ru](http://fix_complex_reification.ru/) \ --update ../sparql/[rm_xref_by_prefix.ru](http://rm_xref_by_prefix.ru/) \ --update ../sparql/[fix_make_omim_exact.ru](http://fix_make_omim_exact.ru/) \ remove -T config/properties.txt --select complement --select properties --trim true \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo/sources/doid.owl --version-iri http://purl.obolibrary.org/obo/mondo/sources/2023-08-31/doid.owl -o components/doid.owl; fi if [ true = true ] && [ true = true ]; then robot --catalog catalog-v001.xml merge -I https://github.com/monarch-initiative/gard/releases/latest/download/gard.owl \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo-ingest/component-download-gard.owl annotate -V http://purl.obolibrary.org/obo/mondo-ingest/releases/2023-08-31/component-download-gard.owl --annotation owl:versionInfo 2023-08-31 -o tmp/component-download-gard.owl.owl; fi if [ true = true ]; then robot --catalog catalog-v001.xml query -i "tmp/component-download-gard.owl.owl" -q "../sparql/gard-relevant-signature.sparql" tmp/gard_relevant_signature.txt; fi if [ true = true ]; then robot --catalog catalog-v001.xml remove -i tmp/component-download-gard.owl.owl --select imports \ remove -T tmp/gard_relevant_signature.txt --select complement --select "classes individuals" --trim false \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo/sources/gard.owl --version-iri http://purl.obolibrary.org/obo/mondo/sources/2023-08-31/gard.owl -o components/gard.owl; fi if [ true = true ]; then wget "https://data.bioontology.org/ontologies/ICD10CM/submissions/22/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb" -O tmp/icd10cm.tmp.owl && robot --catalog catalog-v001.xml remove -i tmp/icd10cm.tmp.owl --select imports \ remove -T config/remove_properties.txt \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo-ingest/component-download-icd10cm.owl annotate -V http://purl.obolibrary.org/obo/mondo-ingest/releases/2023-08-31/component-download-icd10cm.owl --annotation owl:versionInfo 2023-08-31 -o tmp/component-download-icd10cm.owl.owl; fi --2023-08-31 19:10:09-- https://data.bioontology.org/ontologies/ICD10CM/submissions/22/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb Resolving [data.bioontology.org](http://data.bioontology.org/) ([data.bioontology.org](http://data.bioontology.org/))... 171.64.13.9 Connecting to [data.bioontology.org](http://data.bioontology.org/) ([data.bioontology.org](http://data.bioontology.org/))|171.64.13.9|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 50870565 (49M) [application/octet-stream] Saving to: ‘tmp/icd10cm.tmp.owl’ tmp/icd10cm.tmp 100%[=====>] 48.51M 1.34MB/s in 37s 2023-08-31 19:10:51 (1.30 MB/s) - ‘tmp/icd10cm.tmp.owl’ saved [50870565/50870565] if [ true = true ]; then robot --catalog catalog-v001.xml query -i "tmp/component-download-icd10cm.owl.owl" -q "../sparql/icd10cm-relevant-signature.sparql" tmp/icd10cm_relevant_signature.txt; fi if [ true = true ]; then robot --catalog catalog-v001.xml merge -i tmp/component-download-icd10cm.owl.owl \ rename --mappings config/property-map-1.sssom.tsv --allow-missing-entities true \ rename --mappings config/property-map-2.sssom.tsv --allow-missing-entities true \ remove -T tmp/icd10cm_relevant_signature.txt --select complement --select "classes individuals" --trim false \ remove -T tmp/icd10cm_relevant_signature.txt --select individuals \ query \ --update ../sparql/[fix_omimps.ru](http://fix_omimps.ru/) \ --update ../sparql/[fix-labels-with-brackets.ru](http://fix-labels-with-brackets.ru/) \ --update ../sparql/[fix_hgnc_mappings.ru](http://fix_hgnc_mappings.ru/) \ --update ../sparql/[fix_complex_reification.ru](http://fix_complex_reification.ru/) \ remove -T config/properties.txt --select complement --select properties --trim true \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo/sources/icd10cm.owl --version-iri http://purl.obolibrary.org/obo/mondo/sources/2023-08-31/icd10cm.owl -o components/icd10cm.owl; fi if [ true = true ] && [ true = true ]; then robot --catalog catalog-v001.xml merge -I https://github.com/monarch-initiative/icd10who/releases/latest/download/icd10who.ttl \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo-ingest/component-download-icd10who.owl annotate -V http://purl.obolibrary.org/obo/mondo-ingest/releases/2023-08-31/component-download-icd10who.owl --annotation owl:versionInfo 2023-08-31 -o tmp/component-download-icd10who.owl.owl; fi if [ true = true ]; then robot --catalog catalog-v001.xml query -i "tmp/component-download-icd10who.owl.owl" -q "../sparql/icd10who-relevant-signature.sparql" tmp/icd10who_relevant_signature.txt; fi if [ true = true ] ; then robot --catalog catalog-v001.xml remove -i tmp/component-download-icd10who.owl.owl --select imports \ rename --mappings config/property-map-1.sssom.tsv --allow-missing-entities true \ rename --mappings config/property-map-2.sssom.tsv --allow-missing-entities true \ remove -T tmp/icd10who_relevant_signature.txt --select complement --select "classes individuals" --trim false \ remove -T tmp/icd10who_relevant_signature.txt --select individuals \ query \ --update ../sparql/[fix_omimps.ru](http://fix_omimps.ru/) \ --update ../sparql/[fix-labels-with-brackets.ru](http://fix-labels-with-brackets.ru/) \ --update ../sparql/[fix_hgnc_mappings.ru](http://fix_hgnc_mappings.ru/) \ --update ../sparql/[fix_complex_reification.ru](http://fix_complex_reification.ru/) \ remove -T config/properties.txt --select complement --select properties --trim true \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo/sources/icd10who.owl --version-iri http://purl.obolibrary.org/obo/mondo/sources/2023-08-31/icd10who.owl -o components/icd10who.owl; fi if [ true = true ] && [ true = true ]; then robot --catalog catalog-v001.xml merge -I http://purl.obolibrary.org/obo/ncit.owl \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo-ingest/component-download-ncit.owl annotate -V http://purl.obolibrary.org/obo/mondo-ingest/releases/2023-08-31/component-download-ncit.owl --annotation owl:versionInfo 2023-08-31 -o tmp/component-download-ncit.owl.owl; fi if [ true = true ]; then robot --catalog catalog-v001.xml query -i "tmp/component-download-ncit.owl.owl" -q "../sparql/ncit-relevant-signature.sparql" tmp/ncit_relevant_signature.txt; fi if [ false = false ] && [ true = true ]; then robot --catalog catalog-v001.xml remove -i tmp/component-download-ncit.owl.owl --select imports \ rename --mappings config/property-map-1.sssom.tsv --allow-missing-entities true \ rename --mappings config/property-map-2.sssom.tsv --allow-missing-entities true \ query --update ../sparql/[rm_xref_by_prefix.ru](http://rm_xref_by_prefix.ru/) \ remove -T tmp/ncit_relevant_signature.txt --select complement --select "classes individuals" --trim false \ remove -T config/properties.txt --select complement --select properties --trim true \ remove --term "http://purl.obolibrary.org/obo/NCIT_C179199" --axioms "equivalent" \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo/sources/ncit.owl --version-iri http://purl.obolibrary.org/obo/mondo/sources/2023-08-31/ncit.owl -o components/ncit.owl; fi Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Arrays.java:3537) at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100) at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:130) at java.base/sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:234) at java.base/sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:304) at java.base/sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282) at java.base/sun.nio.cs.StreamEncoder.write(StreamEncoder.java:132) at java.base/java.io.OutputStreamWriter.write(OutputStreamWriter.java:205) at org.apache.jena.atlas.io.BufferingWriter.flushBuffer(BufferingWriter.java:132) at org.apache.jena.atlas.io.BufferingWriter.output(BufferingWriter.java:126) at org.apache.jena.atlas.io.BufferingWriter.write(BufferingWriter.java:175) at org.apache.jena.atlas.io.IndentedWriter.write$(IndentedWriter.java:187) at org.apache.jena.atlas.io.IndentedWriter.printOneChar(IndentedWriter.java:182) at org.apache.jena.atlas.io.IndentedWriter.print(IndentedWriter.java:123) at org.apache.jena.riot.out.quoted.QuotedURI.writeURI(QuotedURI.java:46) at org.apache.jena.riot.out.NodeFormatterNT.formatURI(NodeFormatterNT.java:46) at org.apache.jena.riot.out.NodeFormatterTTL.formatURI(NodeFormatterTTL.java:88) at org.apache.jena.riot.out.NodeFormatterBase.formatURI(NodeFormatterBase.java:70) at org.apache.jena.riot.out.NodeFormatterBase.format(NodeFormatterBase.java:43) at org.apache.jena.riot.writer.TurtleShell.writeNode(TurtleShell.java:982) at org.apache.jena.riot.writer.TurtleShell$ShellGraph.writePredicate(TurtleShell.java:750) at org.apache.jena.riot.writer.TurtleShell$ShellGraph.writePredicateObjectList(TurtleShell.java:715) at org.apache.jena.riot.writer.TurtleShell$ShellGraph.writePredicateObjectList(TurtleShell.java:693) at org.apache.jena.riot.writer.TurtleShell$ShellGraph.writeNestedObject(TurtleShell.java:828) at org.apache.jena.riot.writer.TurtleShell$ShellGraph.writeNestedObjectTopLevel(TurtleShell.java:796) at org.apache.jena.riot.writer.TurtleShell$ShellGraph.writeBySubject(TurtleShell.java:618) at org.apache.jena.riot.writer.TurtleShell$ShellGraph.writeGraph(TurtleShell.java:498) at org.apache.jena.riot.writer.TurtleShell$ShellGraph.access$100(TurtleShell.java:111) at org.apache.jena.riot.writer.TurtleShell.writeGraphTTL(TurtleShell.java:97) at org.apache.jena.riot.writer.TurtleWriter$TurtleWriter$.write(TurtleWriter.java:44) at org.apache.jena.riot.writer.TurtleWriter$TurtleWriter$.access$000(TurtleWriter.java:34) at org.apache.jena.riot.writer.TurtleWriter.output(TurtleWriter.java:31) make: *** [components/ncit.owl] Error 1 ```

I tried this flag even though I didn't expect it to work for make test. It doesn't. Still running out of memory. Maybe we need to implement a similar flag? IMP_LARGE=true # Global parameter to bypass handling of large imports

matentzn commented 1 year ago

You need I think ca 18 GB memory in Java and 21 GB memory in docker to run this locally I think.. Do you have that configured?

Else, cant if you don't need to update sources, you could use COMP=false to bypass all the component generations?

joeflack4 commented 1 year ago

Yeah I have 28GB allocated in Docker but I didn't pass -Xmx 18g or whatever Java flag. If I'm running make test, I'm not sure how to enable that; maybe an edit to run.sh? We can look into that later.

I will try COMP=false and report back, thanks.

joeflack4 commented 1 year ago

Darn, looks like the COMP=false didn't work. Again, not the biggest priority; we can worry about this when you have time.

matentzn commented 1 year ago

Why not? If you look at for example:

https://github.com/monarch-initiative/mondo-ingest/blob/main/src/ontology/mondo-ingest.Makefile#L39

COMP=false should prevent components to be rebuild. Or are you saying a non-component goals still needs too much memory?

joeflack4 commented 1 year ago

Hmm, maybe I was running it the wrong way. I tried: COMP=false sh run.sh make test

I was thinking that the order of the flag shouldn't matter, but I'll try the below one now. sh run.sh make COMP=false test

Here's all that remains from the log from yesterday. You can see it's doing something with tmp/component-download-ncit.owl.owl.

Log

``` --update ../sparql/fix_complex_reification.ru \ remove -T config/properties.txt --select complement --select properties --trim true \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo/sources/icd10who.owl --version-iri http://purl.obolibrary.org/obo/mondo/sources/2023-09-11/icd10who.owl -o components/icd10who.owl; fi if [ true = true ] && [ true = true ]; then robot --catalog catalog-v001.xml merge -I http://purl.obolibrary.org/obo/ncit.owl \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo-ingest/component-download-ncit.owl annotate -V http://purl.obolibrary.org/obo/mondo-ingest/releases/2023-09-11/component-download-ncit.owl --annotation owl:versionInfo 2023-09-11 -o tmp/component-download-ncit.owl.owl; fi if [ true = true ]; then robot --catalog catalog-v001.xml query -i "tmp/component-download-ncit.owl.owl" -q "../sparql/ncit-relevant-signature.sparql" tmp/ncit_relevant_signature.txt; fi if [ false = false ] && [ true = true ]; then robot --catalog catalog-v001.xml remove -i tmp/component-download-ncit.owl.owl --select imports \ rename --mappings config/property-map-1.sssom.tsv --allow-missing-entities true \ rename --mappings config/property-map-2.sssom.tsv --allow-missing-entities true \ query --update ../sparql/rm_xref_by_prefix.ru \ remove -T tmp/ncit_relevant_signature.txt --select complement --select "classes individuals" --trim false \ remove -T config/properties.txt --select complement --select properties --trim true \ remove --term "http://purl.obolibrary.org/obo/NCIT_C179199" --axioms "equivalent" \ annotate --ontology-iri http://purl.obolibrary.org/obo/mondo/sources/ncit.owl --version-iri http://purl.obolibrary.org/obo/mondo/sources/2023-09-11/ncit.owl -o components/ncit.owl; fi Killed make: *** [mondo-ingest.Makefile:73: components/ncit.owl] Error 137 Command exited with non-zero status 2 ### DEBUG STATS ### Elapsed time: 39:55.68 Peak memory: 15397432 kb ```

I believe I also tried SKIP_HUGE=true earlier as well, but I'm not 100% certain.

matentzn commented 1 year ago

COMP=false sh run.sh make test

If you do this, you are passing the COMP parameter to the shell script, not make. The second solution is the right one.

"Killed" means it ran out of memory, or rather, not often than not, that you gave more memory to the robot process (check run.sh) then you allocated to docker. Always make sure you are 15-20% below docker Max.

joeflack4 commented 1 year ago

Yeah I should have tried that earlier. It passed:

Finished running all tests successfully.
Elapsed time: 14:30.81
Peak memory: 8562068 kb

I know about Killed. I'll jot down a note about that robot memory mismatch.. good to remember if it ever comes up.