metafacture / metafacture-documentation

The central place for documentation about metafacture
http://metafacture.github.io/metafacture-documentation/
Apache License 2.0
3 stars 3 forks source link

use of collect / count and sort triple commands #3

Closed guenterh closed 4 years ago

guenterh commented 8 years ago

Problem

We get in trouble because of insufficient memory and had to split the data in smaller sets for processing which is not only cumbersome but also gives wrong results because analysis has to be done on the complete data set.

Question

I have seen there is a mechanism triggered by a flag called 'memorylow' https://github.com/culturegraph/metafacture-core/blob/master/src/main/java/org/culturegraph/mf/stream/pipe/sort/AbstractTripleSort.java#L99 which makes it possible to swap triples to the file system as temporary store

Thanks for any hints - Günter!

cboehme commented 8 years ago

Hi Günter,

the mechanism for memory management in the TripleSort and TripleCollect is fully implemented. The memoryLow() method is not called by the user but it is invoked from the MemoryWarningSystem every time more than 80% of memory available in the jvm is used. The next time the process() is called it checks whether memoryLow() was invoked and if it was then the current list of triples is saved to disk. This mechanism works automatically and does not need to be configured explicitly.

The idea behind this logic was to enable the user to sort arbitrarily large data sets without having to tinker with the memory settings of the jvm. However, it has turned out that the mechanism is not as tinker-free as we had hoped.

The main issue seems to be that the AbstractTripleSort does not immediately free its memory when the 80% threshold is reached but only once the process() method is called. If the remaining 20% of memory are not enough to fulfill all memory allocations before process() is called the next time then the JVM will throw an OutOfMemoryException. We have encountered this problem as well and have not yet found a good solution for it. What helps as a workaround is to increase the JVM memory so that once the 80% threshold is reached, more memory in absolute numbers is still available.

If you have an idea for a better implementation of the automatic memory management we are glad to change the current implementation.

Best, Christoph

guenterh commented 8 years ago

Thanks for this background information Christoph! With this first I will give it a try again in the way you described it and secondly I'm going to think about it (if it still doesn't solve our problem)

dr0i commented 7 years ago

Hello @guenterh , what is the state of this issue - could you solve it following the hints from @cboehme ? Can we close this issue?

dr0i commented 4 years ago

Closing.