renepickhardt / generalized-language-modeling-toolkit

Generalized Language Modeling toolkit
http://glm.rene-pickhardt.de
51 stars 17 forks source link

Aggregator runs out of memory #31

Closed renepickhardt closed 9 years ago

renepickhardt commented 10 years ago

This is most certainly due to the fact that the hashmaps are running out of memory. so here are some possible fixes:

1.) make an own index for POS and use this instead of the wordindex 2.) make small sorted file chunks and an external n-way merge 3.) increase maxCountDivider (painful: yields recalculation of sequencer)

regarding one: another option here would be to switch from strings to int tokens since WordIndex.rank() would also be much faster.

lschmelzeisen commented 10 years ago

From #37

for big data sets the aggregator is not running through. currently the program is not stopping but it kind of hangs. i guess the jvm can't allocate memory for bigger hashmaps and java tries to fill up the hashmap and resolves collisions.

there are several fixes (if memory in the aggregator is critical decrease the number of paralel tasks) change the workflow while aggregating towards aggregating several smaller sorted files and aggregate the final version with an n way aggregation merge sort.

lschmelzeisen commented 9 years ago

I believe this to be fixed. If you are to find cases of OutOfMemory Errors again, open a new bug please.