hfst / hfst-ospell

HFST spell checker library and command line tool
Apache License 2.0
13 stars 9 forks source link

hfst-ospell leaks memory #27

Open snomos opened 8 years ago

snomos commented 8 years ago

It seems that hfst-ospell is leaking memory. When run against a list of 12.5k typos, it starts out at about 130 Mb usage, and end up at almost 800 Mb before it is done. The memory use increases slowly throughout the processing. See attached screen shots of the memory consumption at different intervals. The zhfst file used is attached (rename .zip to .zhfst).

memory-leak-early

memory-leak-late

Tested against latest hfst-ospell code on macOS Sierra 10.12.1.

se.zip

snomos commented 8 years ago

The typos file used as input can be found here

Traubert commented 7 years ago

For some reason I can't get Valgrind to report memory leaks usefully for ospell, but those leaks are overall small in size. I think this isn't exactly a leak, but an effect of the cache slowly building up. The error model + lexicon are rich enough in this case that on average a cache entry is tens of megabytes, and as the correction encounters new first symbols, the cache gets bigger and bigger. Eg. you should see different behaviour if you first sort the typos vs. if you only have one type per first character.

Perhaps the system should have some sort of cache priority order, bumping off older cache entries once the cache grows beyond some size?

unhammer commented 7 years ago

Yeah, if I grep for only words starting with a, memory usage is consistent at ~126000 at least up to 100k words.

On regular input, memory usage flattens out as input grows, which is also consistent with the only "leak" being the cache on first symbols.

I tried e.g. clearing some random entries (whether .empty or not) on every correction, which gives slower memory growth at the cost of some speed loss (50% slower on 10k words when removing 3 random). Adding a use counter and strictly capping the amount of cached entries to 10 first symbols with plain LRU, the runtime is 118s vs 50s, with 486M max RES vs 712M max RES on a run of 10k words. With zero cache, we still use 414M on 10k words, so compared to that, upping it to 10 first symbols doesn't seem too bad. (I also tried LRU of 2 random entries, but there was basically no speed difference, probably because the list of cached entries is so short.)

However, I'm pretty sure .clear() can't be clearing everything, since sending 10k words-starting-with-a has a much lower memory usage – anyone have an idea why? https://github.com/hfst/hfst-ospell/compare/master...unhammer:LRU-cache?expand=1