Differentiate Similar language

komodojp / tinyld

Simple and Performant Language detection library for NodeJS

https://komodojp.github.io/tinyld/

MIT License

415 stars 12 forks source link

Differentiate Similar language #1

Closed kefniark closed 2 years ago

kefniark commented 3 years ago

Description

Some pair of language are always at the top of the detection errors:

pt -> es : 13.9375% (error: 1355)
en -> nl : 15.8565% (error: 884)
pt -> it : 5.431% (error: 528)
ru -> uk : 2.4686% (error: 240)

And all of them make sense, dutch and english are really close, same for portuguese and spanish.

The idea is to find a way to reduce the error rate by putting some extra weight on grams in only one language of the pair.

kefniark commented 2 years ago

Started to investigate the idea of pre-building small n-gram dictionaries to identify gram unique to a language in a family.

Only make dictionaries for language groups with high error rate.
In the algorithm it would be a 4st steps, at the end of the process.

Example

Make dictionaries like Spanish - Portuguese, English-Dutch-German and identify grams unique to each language in those family.
Then at query time, if a chunk has both Spanish and Portuguese in the possible results, check if they have any of those "Unique" grams and weight the final %

kefniark commented 2 years ago

Tried and it was slightly working for some pair of languages, but not for other and even cause some accuracy drop for some. And overall the result was far from useful, only +0.25% accuracy for lot of dedicated code and data. I decided to give up on that and focus on other area for the moment