optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
567 stars 165 forks source link

Library leaks memory #66

Closed edudar closed 7 years ago

edudar commented 7 years ago

Here's what memory profile looks like before the library was introduced to the service:

screen shot 2016-09-27 at 12 15 41 pm

As the library is not thread-safe I use a pool of 8 instances. Here's what a period between server restarts looks like. There's clear initial memory grab when detectors are initialized. Then periods of constant memory with occasional slides down. I have 4G Xmx with 7.5G total server memory.

screen shot 2016-09-27 at 12 15 18 pm
fabiankessler commented 7 years ago

The library is meant to be thread safe. The readme says in the changes from fork section:

LanguageDetector is now safe to use multi-threaded. Clear code to safely load profiles and use them, no state in static fields.

Do you have multi-threading issues with this library version?

The language profiles are what grab the memory initially. The LanguageProfileImpl is immutable, and that means thread safe. If you have to create multiple language detectors with the LanguageDetectorBuilder then you should try re-using the same language profile instances. That would mean you'd only use the same amount of memory, not 8x the memory.

Everything in the lib is immutable and state is garbage collected after running one detection. I suspect that the memory leak has to do with your integration, or with other code. Of course I can be wrong. Can you show some code? Can you reproduce thread safety or memory leak issues in a test case?

fabiankessler commented 7 years ago

It's the NgramFrequencyData object that is used when running. It is immutable also. The language profiles (LanguageProfileImpl) are only an intermediary step when constructing the detector.

edudar commented 7 years ago

@fabiankessler I've worked with the lib later on and here are couple findings I did.

1) A multi-threading restriction was apparently passed by Tika's wrapper around language-detector. We use Tika in other places so it was a natural choice. After I switched to bare metal version it looks fine with one instance only for all threads.

2) After monitoring memory behavior for couple days I can say that it's not a memory leak per se but rather weird memory consumption pattern. Memory never went below the bottom line as it's shown in the screenshot. The path to that state varies from rapidly down to gradually and slowly. The latter one surely looks like a memory leak but it never passed that 22-23% line. Switching to proper multi-threaded version should help here I suppose. Sorry for the confusion caused.