komodojp / tinyld

Simple and Performant Language detection library for NodeJS
https://komodojp.github.io/tinyld/
MIT License
415 stars 12 forks source link

Differentiate Similar language #1

Closed kefniark closed 2 years ago

kefniark commented 3 years ago

Description

Some pair of language are always at the top of the detection errors:

And all of them make sense, dutch and english are really close, same for portuguese and spanish.

The idea is to find a way to reduce the error rate by putting some extra weight on grams in only one language of the pair.

kefniark commented 2 years ago

Started to investigate the idea of pre-building small n-gram dictionaries to identify gram unique to a language in a family.

Example

kefniark commented 2 years ago

Tried and it was slightly working for some pair of languages, but not for other and even cause some accuracy drop for some. And overall the result was far from useful, only +0.25% accuracy for lot of dedicated code and data. I decided to give up on that and focus on other area for the moment