Correct label and biological process issues for model annotation loader, possible other improvements

kltm commented 6 years ago

Looking at the output of a loader run: http://tomodachi.berkeleybop.org/amigo/search/model_annotation

The values in the return documents that drive this interface are like:

      {
        "document_category":"model_annotation",
        "id":"gomodel:581e072c00000062|http://model.geneontology.org/581e072c00000062/581e072c00000098|FB:FBgn0005672|GO:0048019|",
        "annotation_unit":"gomodel:581e072c00000062|http://model.geneontology.org/581e072c00000062/581e072c00000098|FB:FBgn0005672|GO:0048019|",
        "annotation_unit_label":"gomodel:581e072c00000062 spi Dmel GO:0048019",
        "annotation_unit_label_searchable":"gomodel:581e072c00000062 spi Dmel GO:0048019",
        "model":"gomodel:581e072c00000062",
        "model_label":"EGFR_SPI_DMEL",
        "model_label_searchable":"EGFR_SPI_DMEL",
        "model_label_searchable":"EGFR_SPI_DMEL",
        "model_url":"http://noctua.berkeleybop.org/editor/graph/581e072c00000062.ttl",
        "model_state":"development",
        "model_date":"2016-11-07",
        "model_date_searchable":"2016-11-07",
        "model_date_searchable":"2016-11-07",
        "enabled_by":"FB:FBgn0005672",
        "enabled_by_searchable":"FB:FBgn0005672",
        "enabled_by_label":"spi Dmel",
        "enabled_by_label_searchable":"spi Dmel",
        "enabled_by_label_searchable":"spi Dmel",
        "function_class":"GO:0048019",
        "function_class_label":"GO:0048019",
        "function_class_label_searchable":"GO:0048019",
        "function_class_label_searchable":"GO:0048019",
        "function_class_closure_map":"{GO:0048019=null}",
        "location_list_closure_map":"{GO:0005615=null}",
        "owl_blob_json":"JSON_BLOB_REMOVED_FOR_BREVITY",
        "evidence_type":"ECO:0000315",
        "evidence_type_label":"ECO:0000315",
        "evidence_type_label_searchable":"ECO:0000315",
        "evidence_type_label_searchable":"ECO:0000315",
        "evidence_type_closure_map":"{ECO:0000315=null}",
        "reference":["PMID:7833286","PMID:7833286\t"],
        "contributor":["http://orcid.org/0000-0003-3212-6364"],
        "evidence_type_closure":["ECO:0000315"],
        "reference_searchable":[
          "PMID:7833286",
          "PMID:7833286\t",
          "PMID:7833286",
          "PMID:7833286\t"],
        "contributor_searchable":[
          "http://orcid.org/0000-0003-3212-6364",
          "http://orcid.org/0000-0003-3212-6364"],
        "location_list_closure":["GO:0005615"],
        "function_class_closure":["GO:0048019"],
        "location_list":["GO:0005615"],
        "location_list_label":["GO:0005615"],
        "score":1.0
}

There are several oddities in this load:

The id and annotation_unit are generated as a pipe-separated combination of model id, M, P, and C; it seems that one of those is missing as there is a trailing pipe
all *_searchable fields seem to be repeats; this may not be a "real" issue, practically speaking
reference and reference_searchable seem to contain tabs; they should not
all *_closure_map fields contain a "null", which would seem to indicate some issue with JSON deserialization or similar; the form should be id:label; note that this could merely be that there is something wrong with labels (see below)
for *_label*, in almost all cases, there are no labels, just ids; it could be that the ontology information to get label is not loading properly as part of the command or something deeper
there should be a field set process_class, just as there is function_class; no process fields exist

There are also a couple of issues that should be coordinated with other development, such as either switching to names for contributor (or making an id/label map), adding groups, etc.

While these may all be separate tickets, the tooling to get at the issues is similar for starters.

kltm commented 6 years ago

The used command line is:

This requires a checkout of geneontology/noctua-models; I suspect that --solr-url should be set to mock to prevent actually attempting a load against a server.

kltm commented 6 years ago

Tagging @yy20716

yy20716 commented 6 years ago

I began checking one of the issues, i.e. the field process_class is not generated. I speculate that this issue is also related to the namespace issue. Here is the trace:

The main problem is that that the process map here is always empty.
This process map is generated by checking whether bpSet.contains(bpType) from this part but the bpSet is always empty.
The bpSet is (supposed to be) filled up by the method getAspect which uses graph.getNamespace, but it always return the null right now.

@cmungall I think that this happens because the input graph (e.g. go-gaf.owl?) does not include namespace tags anymore. This would work again if we could disable or revise the logic that checks namespace, but please let me know if you have any other suggestions. Thank you.

cmungall commented 6 years ago

This is familiar, I think you found the exact issue earlier when you were debugging the prediction code.

See https://github.com/owlcollab/owltools/pull/239

cmungall commented 6 years ago

bpSet = reasoner.getSubClassesOf(IRI.create("http://purl.obolibrary.org/obo/GO_0008150"))

yy20716 commented 6 years ago

@cmungall I tried your approach but that one does not work because getSubClass does not accept IRI object.

reasoner.getSubClassesOf(IRI.create("http://purl.obolibrary.org/obo/GO_0008150"))

So I tried this way:

OWLClass c = graph.getManager().getOWLDataFactory().getOWLClass(IRI.create("http://purl.obolibrary.org/obo/GO_0008150"));

The problem is that the ontologies do not have the term GO_0008150 inside, e.g.,

System.out.println("c: " + c + " num: " + model.getDeclarationAxioms(c).size());

The above code prints

c: <http://purl.obolibrary.org/obo/GO_0008150> num: 0

i.e., no such instances in ontology. Because of that, the reasoner just returns owl:Nothing, thus the bpSet is still null. I guess that's why the previous codes refers to the namespace tag (instead of depending on the reasoner). Can you please advise what to do in this situation? I first thought that we could make and use the separate reasoner instance that loaded the full gene ontology, but maybe this is a way overkill.

cmungall commented 6 years ago

On 9 May 2018, at 18:15, HyeongSik Kim wrote:

Chris, I tried your approach but that one does not work because getSubClass does not accept IRI object.
reasoner.getSubClassesOf(IRI.create("http://purl.obolibrary.org/obo/GO_0008150"))
So I tried this way:
OWLClass c = 
graph.getManager().getOWLDataFactory().getOWLClass(IRI.create("http://purl.obolibrary.org/obo/GO_0008150"));

looks good

The problem is that the ontologies do not have the term GO_0008150 inside, e.g.,
System.out.println("c: " + c + " num: " + 
model.getDeclarationAxioms(c).size());
The above code prints
c: <http://purl.obolibrary.org/obo/GO_0008150> num: 0

we don't expect the declarations to be in the model - they should be in the import chain, and the reasoner uses the full import chain

i.e., no such instances in ontology. Because of that, the reasoner just returns owl:Nothing, thus the bpSet is still null. I guess that's why the previous codes refers to the namespace tag (instead of depending on the reasoner). Can you please advise what to do in this situation? I first thought that we could make and use the separate reasoner instance that loaded the full gene ontology, but maybe this is a way overkill.

is this using a test ontology?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/owlcollab/owltools/issues/253#issuecomment-387921119

yy20716 commented 6 years ago

Uh, so it turns out that I didn't deep-copy all ontologies but only did the ones in noctua-model, which is why the reasoning was not happening properly. I forgot the fact that the single OWLOntologyManager can contains multiple ontologies inside.
I revised the codes so that all loaded ontologies can be deep-copied. I now see that labels are properly generated but we do still not experience memory leaks. I will soon make a pull request that contains revised codes. Thank you so much for your patience.

yy20716 commented 6 years ago

FYI. @cmungall, unfortunately I was not able to find the post I was talking about (it's possible that my memory is not correct or it was just my imagination). However, I got to know the usage of the copy method from this post, i.e. OWLOntologyManager.copyOntology does not copy imports which complains that deep-copying ontology does not process import statements (it is now solved). It's possible that that user would use the copyOntology method for a different purpose, but I thought that this method could be used to address our issue here.
Let me summarize the problem and the solution for you (and possibly others):
- Using reasoners via owlapi (particularly elk) sometimes causes inferred or referred axioms remain in the ontology manager even after we call flush and dispose methods in owlapi, i.e. GC does or cannot collect memories used for such axioms even if they are not used anymore for some reasons. This is not noticeable or not a significant issue if we run the reasoner only one time or a few times, i.e. once the ontology manager is destroyed, all linked axioms and ontologies are automatically removed by GC (and therefore no memory leaks happen).
- However, the situation here may be a bit unique, i.e., we do not destroy the ontology manager but keep using it to avoid repeated loading of tboxes. In other words, we import tbox only once (e.g., go-gaf.owl) and do reasoning over ttl files in noctua-models for generating solr indexes.
  - Essentially what happens here is that we repeat reasoning over abox + tbox, where (i) tbox is fixed and (ii) a different set of aboxes are used each time. The number of abox sets is often pretty large, e.g., approx 2k ttl files in the noctua's case. We observed that unreleased axioms are piling up in the ontology manager for each reasoning session, even if we create and use the reasoner instance for each reasoning step. This issue later causes OOM and halts the execution of owltools.
- It's possible that there are different ways to clean up such axioms, but my observation was that such axioms are only released once the connected ontology manager object is destroyed. However, the challenge here is that we don't want to destroy this manager because we then have to import all tbox statements again, which takes a lot of time. It's possible that this bug comes from the elk but I was not fully sure about this.
- One workaround I found is to deep-copy ontology instances into a temporary manager and destory the copied one once we finish using it. An overall flow can be summarized as follow:
- As usual, we first import all axioms in tbox to the ontology manager
- for each reasoning session, we do not directly create a reasoner over the tbox but
  - Create a temporary manager and deep-copy ontology instances from the initial manager to this new manager,
  - Create a new, temporary reasoner instance over the copied instance and do reasoning.
  - Dispose of all these temporary manager and reasoner.
- Please note that this deep-copying method is available at OWLAPI, i.e., the ontology copy will create a new OWLOntology instance with same ontology annotations, same ID, and same axioms. Format, document IRI will also be copied over. This process does not take much time and memory because it is in-memory operation. This cannot be achieved using other external libraries such as Kyro or Google's cloning library, i.e. reasoners are not serializable.

@balhoff it's a bit late to ask this question to you, but we wonder whether you have experienced any similar issues before. If you have, could you please share your experience and approach? We thought that similar issues could arise in the minerva but maybe it was not that obvious like this case. Thank you.

p.s. as requested, I will see if I can reproduce this issue using simple, dummy ontologies.

cmungall commented 6 years ago

@yy20716 thank you for the detailed analysis and write-up. @ignazio1977 is this something to bring up on the owlapi tracker (ignore the bulk of this ticket, just look at the comment above this one)

ignazio1977 commented 6 years ago

@cmungall Yes it sounds like an issue for OWLAPI. Not sure how this happens, as managers do not refer axioms directly - I'd guess there must be an ontology that doesn't get cleared when emptying the manager. An example, or a heap dump of a manager with the noctua-models loaded, would help.

yy20716 commented 6 years ago

I began to think that this might not be the issue from OWLAPI but something deep in owltools causes them. I was working on making an example case with the same loader codes but I was not able to reproduce this issue, i.e. no memory leaks, while I still see the memory leak from the old owltools' codes. I believe that the cultprits would be OWLGraphWrapper* classes but I am still not sure about the root of the causes yet. But anyway I will continue trying before today and upload the test codes soon.
In my opinion, deep-copying ontologies with a temporary manager would still be a good workaround to avoid the memory leaks. If any other memory leaks would be observed again in the owltools (or other ones such as minerva), the same workaround could be applied for a while. However, my suggestion is that eventually we would need to simplify codes and build a minimal, dedicated loader, so that we can easily track the issue and prevent any similar issues in the future.

Thank you so much for your patience and understanding.

yy20716 commented 6 years ago

Here is the repo that contains the testcodes I built: https://github.com/yy20716/owlapi-repeat-model-load

It loads four different noctua-model files using the same manager but I was not able to reproduce the issue. I hope that someone else could continue investigating this issue and nicely refactor owltools in the future. Thank you so much.

cmungall commented 6 years ago

Thanks @yy20716! We can continue it from here.

@ignazio1977 we'll let you know if we can reproduce this purely in the context of the OWLAPI

ignazio1977 commented 6 years ago

Reading the whole issue, one thing comes to mind: some reasoners register themselves as ontology change listeners with the ontology manager for their root ontology. Letting the reasoner variable go out of scope does not remove all references to the reasoner (the reasoner is still referenced by the ontology manager in its listener collection). In turn, a reasoner would hold references to ontologies and possibly axiom lists.

Usually a reasoner should unregister itself when the dispose() method is called. You might want to verify if this is the issue here by checking the list of ontology change listeners associated with the manager that appears to be leaking memory.

What I'd do is to ensure dispose() is called when I'm done with a reasoner and ensure the list of listeners does not increase in size once I've unloaded the abox that I no longer need to use. If the list is increasing in size, there must be extra listeners not being removed - that would provide a clue as to what is holding onto objects that are no longer needed.

Dropping the manager altogether and working with a new manager - i.e., the workaround you already have in place - would fit with this analysis, as this would effectively also drop all the listeners attached to the old manager. Cleaning the listeners would be much faster than copying the ontologies, though.

fanavarro commented 4 years ago

Hi everyone. I found an strange behavior when copying ontologies with imports and I think it could be related with the issues you are discussing here. I've open an issue in owlapi if you are interested in.

Greetings!

owlcollab / owltools

Correct label and biological process issues for model annotation loader, possible other improvements #253