Extrem memory consumption while iterating over multiple entries.

GoogleCodeExporter commented 9 years ago

Trying to iterate over a whole Wiktionary database, or even a small portion of 
it, fails with an OutOfMemoryError (Java heap space). This even occurs for 
example when just writing some of the data to the console (so no memory 
consumption besides the library usage). It seems like there is some serve 
Memory leak somewhere either in JWKTL or in BerkeleyDB.

Of course, a quick fix solution would be to increase the memory available for 
the application (using VM options -Xms & -Xmx), but first of all, the 
OutOfMemoryError occurs even after only iterating a little amount of a 
Wiktionary, expecting an inefficiently high memory need for iterating a whole 
Wiktionary. Secondly it still looks like some kind of memory leak, so it should 
be possible to iterate over the whole Wiktionary without increasing any heap 
space. Or is there any possibility in JWKTL to clear cached data while 
iterating?

What steps will reproduce the problem?
1. Here is an example test method, that tries to extract all german example 
sentences from the german and the english Wiktionary.

    @Test
    public void testGetAllExampleSentences() throws Exception {
        int counter = 0;
        IWiktionary wkt = JWKTL.openCollection(german, english);
        IWiktionaryIterator<IWiktionaryEntry> allEntries = wkt.getAllEntries();
        for (IWiktionaryEntry entry : allEntries) {
            ILanguage language = entry.getWordLanguage();
            if (language != null && language.getName().equals("German")) {
                List<IWikiString> examples = entry.getExamples();
                for (IWikiString example : examples) {
                    String plainText = example.getPlainText();
                    System.out.println(plainText);
                    counter++;
                }
            }
        }
        wkt.close();
        System.out.println(counter);
    }

What is the expected output? What do you see instead?

After 3300 sentences, the output slows down extremely, shortly after throwing 
the OutOfMemoryError mentioned above.

What version of the product are you using? On what operating system?

I use the official JWKTL version 1.0.0 from the Central Maven Repository, using 
this dependency:
        <dependency>
            <groupId>de.tudarmstadt.ukp.jwktl</groupId>
            <artifactId>jwktl</artifactId>
            <version>1.0.0</version>
        </dependency>

Thank you for your help. Besides that, thank you very much for providing this 
library. It helps tremendously! Please continue providing such great libraries 
and tools.
Best wishes,
Andreas

Original issue reported on code.google.com by andreas....@googlemail.com on 4 Dec 2013 at 11:54

GoogleCodeExporter commented 9 years ago

Don't know exactly the cause. At least, I did similar stuff previously and 
didn't have this problem. Try: 

(1) query the English and German Wiktionary editions separately using 

IWiktionaryEdition edition = JWKTL.openEdition..

and check if the error is the same or if there's a problem in the collection 
code (haven't used that very often recently).

(2) Set an explicit cache size, e.g., 

wiktionary = JWKTL.openEdition(wiktionaryPath, 500 * 1024 * 1024L);

Sometimes, I have the impression that the BerkeleyDB makes strange things with 
the available memory...

Original comment by chmeyer.de on 4 Dec 2013 at 3:56

Added labels: Component-API

GoogleCodeExporter commented 9 years ago

I tried your both approaches in my specific scenario:

(1) Isn't working: when opening two separate Wiktionary via JWKTL.openEdition, 
instead of openCollection(...), but switch after each entry, the 
OutOfMemoryError still can be provoked. Reading the editions in sequential 
order works, but doesn't fit my problem.

(2) With 500 MB or even 200 MB the OutOfMemoryError is still occuring. I was 
apple to iterate the two Wiktionary's with an ideal value of 50 MB. I tested it 
with 100 MB, 1MB and even less, and all of that caching sizes where working for 
me, but actually the 50 MB configuration is the fastest running in my current 
setup. [Does anyone have other experiences or more values to add?]

In short: the problem is still existing, but can be avoided by a sufficient 
small explicit cache size value.

Original comment by andreas....@googlemail.com on 10 Dec 2013 at 4:44

inarahd / jwktl

Extrem memory consumption while iterating over multiple entries. #5