pemistahl / lingua-go

The most accurate natural language detection library for Go, suitable for short text and mixed-language text
Apache License 2.0
1.18k stars 67 forks source link

Find more memory-efficient data structure for language models #18

Closed pemistahl closed 2 weeks ago

pemistahl commented 1 year ago

Currently, the language models are loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python.

One promising candidate could be Gonum.

goldsam commented 1 year ago

I think you will benefit from using a trie (not tree) data structure. Here is a Go implementation you may be able to use as a drop in replacement to map.

TomDeneire commented 3 months ago

I wonder if using a SQLite database to store your frequencies would help. That way you would only need to open a database connection and not load full frequency maps into memory. With indexing and query optimization (SQL could handle a piece of the logic of determining the highest frequency) it seems to me it would be pretty fast and very low memory footprint.

pemistahl commented 2 weeks ago

Closed in favor of #68.