pemistahl / lingua-go

The most accurate natural language detection library for Go, suitable for short text and mixed-language text
Apache License 2.0
1.19k stars 66 forks source link

Library size optimization #26

Closed jeffphp closed 1 year ago

jeffphp commented 1 year ago

Thanks for this very efficient library (it's the best I've tested so far).

Unfortunately, I struggle with its size, because whatever parameters I choose, it keeps adding around 120MiB to my app (which is 50MiB, assets included). Since I am using Kubernetes, the docker image matters.

I am only interested in checking a few languages, but it seems that whatever language (or options I choose), the whole package is still compiled. May be I miss something ...

If not, it would be nice to be able to provide the languages as imports (lingua.English, ..., lingua.Languages) in order to keep the binary small.

pemistahl commented 1 year ago

Hi Jeff, thank you for the nice words. :) I'm happy that my library is as useful to you as I hoped it would be.

All language models are embedded into the compiled binary for convenience. Unfortunately, Go does not (yet) support conditional embedding which could solve your problem.

I suggest the following (untested but should work):

  1. Clone this repository to your local computer.
  2. Remove all language models you don't need from the directory language-models.
  3. Build the binary manually and add it to your Docker image.
  4. During runtime, build the language detector from the exact set of languages you need. This way, the detector won't try to access the language model files you have removed.

Please let me know whether this approach works for you.

jeffphp commented 1 year ago

Excellent idea, thanks for that. I will definitely try to do this and let you know how it goes!

jeffphp commented 1 year ago

Thanks, I am back to 60 MiB :)

In case it helps, I use the following library to help connect through Oauth2 : github.com/markbates/goth

The developer chose to provide the different connectors (providers) through different modules, that you can import (or not). If you only need to use Twitter and Google as providers, you just need to import these two : https://github.com/markbates/goth/blob/master/examples/main.go

At compilation time, Go discards unused libraries. But it's true it's a bit less convenient for people who need the whole thing.

pemistahl commented 1 year ago

Glad to know that my approach works for you. :)

Having each language as a separate module is possible, of course, but pretty cumbersome in Go. I think I won't follow this approach for this library. It would be great if Go supported conditional compilation in the same way as Rust does. Maybe, some day...