reasoner never finishes query

GoogleCodeExporter commented 9 years ago

I am having some trouble with ELK queries on a large dataset. I have an OWL 
ontology in a 778 MB RDF/XML file. Using ELK I can precompute the class 
hierarchy for this ontology within several minutes. The process uses about 8 GB 
of RAM out of the 12 GB I allocated it. When I query for subclasses of an 
anonymous class expression, it returns the answer in less than a second. When I 
query a second class expression, the memory usage rapidly increases to 12 GB 
and all (4) CPUs are maxed out. In the log I see that ELK messages are printed 
up to this point:

INFO  org.semanticweb.elk.reasoner.stages.AbstractReasonerStage  - Incremental 
Deletion using 4 workers

It never proceeds beyond there, after waiting more than an hour. The particular 
class expression doesn't matter. If I have two ObjectSomeValuesFrom 
restrictions, A and B, I can query A in less than a second and then have B go 
out of control. If I query B first, then querying A will go out of control.

It is somewhat confusing because within a different application I have 
previously used ELK for datasets of this size. However I can reproduce the 
problem with this small Scala application:

https://github.com/balhoff/test-elk/blob/dd6d9328adc119b993b455ea732399fd581a616
9/src/main/scala/org/nescent/Main.scala

I encountered the issue with ELK 0.4.0, 0.4.1, and a nightly build from 
2014-09-22.

My ontology can be downloaded from here: 
https://dl.dropboxusercontent.com/u/6704325/elk-query/tbox.owl.gz

Original issue reported on code.google.com by balh...@gmail.com on 23 Sep 2014 at 4:20

GoogleCodeExporter commented 9 years ago

Thanks for the report! Finally I was able to reproduce the problem.
I wonder, if you could reduce your TBox? Currently it is pushing the limit of 
the test machine we have (which has just 8GB of RAM). It looks like your TBox 
is combined from several parts judging by different prefixes. Do you have a 
version in which such ontologies are imported as opposed to merged? Perhaps 
removing some of unnecessary parts can still exhibit the problem without 
requiring lots of memory.

Yevgeny

Original comment by ykazako...@gmail.com on 30 Sep 2014 at 6:07

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Hi Yevgeny, I will see if I can reproduce with a smaller version. When I've 
tested the same actions with smaller ontologies I have no problem, but I don't 
know if it's the size or something about the content in this big ontology. I 
don't really have a version with imports because a lot of the content is 
generated and then it's all saved with the inputs into one file.

Original comment by balh...@gmail.com on 1 Oct 2014 at 2:16

GoogleCodeExporter commented 9 years ago

Hi, I ran the test on a machine with more memory, and it managed to finish the 
second query. It just requires more memory (I allocated 20GB to Java from which 
it consumed about 16). The second query is rather slow though. Deletion stage 
takes about 30 seconds, followed by pruning and addition stages that take a 
couple of more minutes. There are quite many rules applied during these stages 
which causes the main problem. I can try to explain you what happens.

So, your ontology seems to be a difficult one :-). The main difficulties are 
caused by the object property chain axioms. You use 122 of them. I guess this 
is how you integrate your databases together. What happens is that, if you 
imagine your ontology as a kind of graph where classes are nodes and object 
properties are edges, then these object property chains create new edges in 
this graph. So the procedure in ELK (roughly) iterates through all edges that 
match the chains and creates the new edges. That is, if class C is connected 
via property R to class D (using ObjectSomeValuesFrom axiom) and class D is 
connected via property S to class E, and there is a role chain axioms 
SubPropertyOf(ObjectPropertyChain(R S) H), then this will be a match for the 
property chain and ELK will infer that C is connected via property H to E. It 
turns out that for your ontology there can be quite a lot of matches: over 3 
billions. However, the number of new edges produced in this way is relatively 
small: only about 80 millions. So many edges are produced many times using 
different matches. The problem is that ELK does not see immediately that many 
of the edges are the same: it inserts all new edges in the queue and processes 
them one by one recursively. But this queue (full of duplicate elements) can 
become quite large during this process and this consumes the majority of memory.

A similar thing happens during the query answering. You can think of the query 
answering process as if ELK temporarily adds a new node to the (completed) 
graph, connects it to some other nodes in this graph according to the 
ObjectSomeValuesFrom relations in the query, and continues to apply the rules, 
in particular, adding new edges according to the new matches of the object 
property chains. Only new matches should be considered, and therefore this 
process is typically much faster than applying the rules to the extended graph 
from scratch.

What happens when the next query is asked, is that ELK should first remove the 
node created by the previous query and revert all edges added after it. The 
main problem is that ELK does not remember which edges where added for this 
node (if to remember for each edge why it was added, it will consume a way more 
memory). Instead, ELK applies the rules for this node from the beginning, but 
deletes the produced edges instead of adding them. This is exactly the deletion 
stage that did not terminate in your example. Why it did not terminate? It 
happened because ELK may have to delete more edges during the deletion stage 
than were added before. This is because that if during the addition stage an 
edge was eddied that was already present in the graph, ELK did not have to 
apply further rules to this edge (since the graph was already complete). During 
the deletion stage, however, ELK does not know if the produced edge was in the 
graph in the first place or was added by this query. To be on the safe side, 
ELK will delete this edge and everything derived from it, and later will have 
to re-derive (repair) the edges that were unnecessarily deleted. Here lies the 
main problem: because there are many rules producing the same edges in many 
different ways, this situation is very likely, and a large portion of the graph 
may be deleted for reverting the results of the query. In your example, your 
almost run out of memory because the queue for the deletion becomes too large, 
and the Java garbage collector spends most of the time trying to free the 
memory (I guess). I hope my explanation makes sense to you :-)

At the moment, I think, there is no easy fix to this problem. I do not think it 
is a bug. Currently DL queries are answered using the ELK incremental procedure 
by adding and removing temporary axioms, which trigger rules for addition and 
deletions. In most cases, incremental changes can be recomputed very fast (as 
you noted, you could not reproduce the problem in other cases), but in some 
cases, deletion can trigger a "chain reaction" where a large part of the 
saturation is deleted. It is difficult to understand why this happens in this 
particular instance. Perhaps the object property "part_of" used in the query is 
very common and can participate in many chain axioms. I might take another look 
at the log traces but it is difficult to understand what is happening for this 
large ontology. 

As a possible remedy, we may consider another scheme of doing query answering 
where we store separately the (temporary) conclusions for the query and 
disregard them after each query. But I am afraid we have to postpone this 
feature to after the next release. But I leave this issue open meanwhile.

Best regards,

Yevgeny

Original comment by ykazako...@gmail.com on 1 Oct 2014 at 8:26

GoogleCodeExporter commented 9 years ago

Thank you for this very informative and understandable response. I am currently 
working around the issue by creating a named class for each query and adding 
the equivalence axiom before querying. Then the query returns quickly, and I 
never delete the added class. It is okay as long as I periodically restart the 
application (it is a long-running web application that answers queries using 
ELK).

I will also see if I can exclude some property axioms. Currently I just include 
all of <http://purl.obolibrary.org/obo/ro.owl>, but it is not all necessary.

Original comment by balh...@gmail.com on 2 Oct 2014 at 2:35

GoogleCodeExporter commented 9 years ago

Hi, just a small update: I have experimented with a new strategy of rule 
application, and I think it might solve the performance issue with the query (I 
think I understood more or less what is happening). I will try to produce a 
working prototype over this weekend. Maybe I will ask you to test it.

Best regards,

Yevgeny

Original comment by ykazako...@gmail.com on 3 Oct 2014 at 10:56

GoogleCodeExporter commented 9 years ago

Hi, could you, please, try the attached version of ELK to see if you have any 
problems with any of your queries or any other issues. I tested it on your 
example and it seems to work fine.

Best regards,

Yevgeny

Original comment by ykazako...@gmail.com on 7 Oct 2014 at 9:13

Attachments:

elk-distribution-0.5.0-SNAPSHOT-owlapi-library.zip

GoogleCodeExporter commented 9 years ago

Great, I will give it a try as soon as I get a chance.

Original comment by balh...@gmail.com on 8 Oct 2014 at 2:02

GoogleCodeExporter commented 9 years ago

I have just had a chance to give it a try. I find that this version uses a 
little more memory than 0.4.1. I allocate 12 GB and our dataset has usually 
used up to 8 or 9 GB after classifying with ELK. This one used over 10 GB.

It performed better for 3 queries, returning in under 1 second, but then on the 
next used quite a bit of CPU and the memory increased. It then hit the 12 GB 
limit and I presume got bogged down in garbage collection (I've been running 
the query for several minutes now).

So it seems better but so far I don't think it is a complete fix. My version of 
our application that adds class expressions for queries and never removes them 
gives great performance, with each query returning immediately.

Original comment by balh...@gmail.com on 27 Oct 2014 at 4:32

GoogleCodeExporter commented 9 years ago

Hi, please try the new version attached. The memory consumption should be now 
improved.
Regarding the performance with your new queries, could you paste them here? No 
need to provide the full code, just the class expression for queries. Can you 
see, does it get stuck on the same stage (Incremental Deletion)?

Original comment by ykazako...@gmail.com on 27 Oct 2014 at 6:55

Attachments:

elk-distribution-0.5.0-SNAPSHOT-owlapi-library.zip

sourish-rygbee / elk-reasoner

reasoner never finishes query #28