pemistahl / lingua

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Apache License 2.0
689 stars 61 forks source link

OutOfMemoryError #191

Closed hasandiwan closed 5 months ago

hasandiwan commented 9 months ago

When I issue the curl command:

curl -vvv https://units.d8u.us/language -d "bonjour tout le monde"

To an @PostMapping endpoint, defined as:

@PostMapping(value = "/language")
public JsonNode language(@RequestBody String text, HttpServletRequest request) {
    ObjectNode ret = mapper.createObjectNode();
    Long start = System.currentTimeMillis();
    com.github.pemistahl.lingua.api.LanguageDetector detector = LanguageDetectorBuilder.fromAllLanguages().build();
    HttpUrl url = HttpUrl.parse(text);
    if (null != url) {
        Map<String, String> htmlParams = Maps.newHashMap();
        htmlParams.put("url", text);
        text = this.html(htmlParams, null, request).get("text").asText();
    }
    Language detected = detector.detectLanguageOf(text);
    ret.put("language", UCharacter.toTitleCase(detected.toString(), >BreakIterator.getTitleInstance()));
    ret.put("time", System.currentTimeMillis() - start);
    return ret;
}

I'm expecting the JSON to contain {"language": "French"}, but I get an OutOfMemoryError... what gives?

pemistahl commented 9 months ago

As mentioned in the docs, loading all language models requires between 3.5 and 4.0 GB of memory. They are loaded into HashMaps which consume a lot of memory.

In your method, you are recreating the LanguageDetector on every request. This is costly, completely unnecessary and most likely produces the OutOfMemoryError. Only create the LanguageDetector instance once in a global place and reuse it throughout in your application. This should probably fix your problems.

pemistahl commented 5 months ago

Since you have not replied anymore, I'm assuming that your memory problems have been fixed. That's why I'm closing this issue now. Feel free to re-open it if my assumption is wrong. Thank you.