pemistahl / lingua-rs

The most accurate natural language detection library for Rust, suitable for short text and mixed-language text
Apache License 2.0
870 stars 38 forks source link

`detect_language_of` hangs forever? #342

Closed BWStearns closed 4 months ago

BWStearns commented 4 months ago

When calling detect_language_of on my ECS service it appears to get stuck and just hang forever. Interestingly this doesn't happen locally. I've gated the

        event!(Level::INFO, "OpenAI Worker: Determining Language for: {:?}", args.input_text);
        // Determine the language of the input
        let detector = LanguageDetectorBuilder::from_all_languages().build();
        event!(Level::INFO, "OpenAI Worker: Detector created");
        let lang = detector.detect_language_of(&args.input_text);
        event!(Level::INFO, "OpenAI Worker: Language detected");

And in the logs I end up with the following and then my worker just hangs forever. It doesn't panic as far as I can tell.

2024-04-28T16:16:19.123011Z  INFO flewent::workers::open_ai_worker: OpenAI Worker: Performing job 73
2024-04-28T16:16:19.123025Z  INFO flewent::workers::open_ai_worker: OpenAI Worker: Determining Language for: "Wie viel kostet die Buch?"
2024-04-28T16:16:19.123207Z  INFO flewent::workers::open_ai_worker: OpenAI Worker: Detector created

And it never gets to Language Detected. Any ideas on why it might be getting stuck? Each container only has memory 512 and cpu 256. Does this need a bigger container?

Edit to add: It seems to happen especially so with short inputs and with German and English in particular. Short French and Russian inputs came back fine, and sufficiently long inputs came back fine. Some examples below (there is a spelling error, but my use case is for a writing assistant so hopefully it doesn't hang if there's typos).

Warum konnen wit nicht? Wie viel kostet die kekse?

pemistahl commented 4 months ago

Please read the docs carefully, especially sections 10.4 and 10.5 in the readme. Your service tries to load all language models into memory in the first call of detect_language_of(). It is better to preload the language models before doing the first classification:

let detector = LanguageDetectorBuilder::from_all_languages().with_preloaded_language_models().build();

However, in your case this will fail as well because you don't have enough memory. You need at least 1 GB RAM to load all language models. An alternative would be to restrict the languages to be loaded.

BWStearns commented 4 months ago

Ahhh ok. Sorry about the confusion! I missed that section.