pemistahl / lingua-rs

The most accurate natural language detection library for Rust, suitable for short text and mixed-language text
Apache License 2.0
870 stars 38 forks source link

Reduce resources to load language models #121

Open pemistahl opened 1 year ago

pemistahl commented 1 year ago

Currently, the language models are parsed from json files and loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python.

One promising candidate could be ndarray.

ghost commented 1 year ago

Which files ? if you require the processing in Python or in JavaScript(Node) I can work on a Google proto buffer format; quite sure the persisted model would be way lighter, maybe the processing would be fast, I do not know. Any way, I'm glad to help. I'm happy that you provide a JS binding as well, I'm looking for a fast language detection runnable on Node. Thanks

ghost commented 1 year ago

I know this is half road, as you were asking for a better structure to gain processing time. But for big model on memory here is a solution:

I changed the format a little bit from regular Map<string: string> to Map<number[]: string[]>. I guess you treat as so anyway, so hopefully not a problem.

Here is a working example in JavaScript/Node: https://github.com/bacloud23/lingua-rs-bigrams

So here how it goes:

Drawback: new protobufjs dependency.

getreu commented 1 year ago

@ghost: By how much your solution reduces the binary size?