komodojp / tinyld

Simple and Performant Language detection library for NodeJS
https://komodojp.github.io/tinyld/
MIT License
415 stars 12 forks source link

English sentence seeing different results in all 3 versions #27

Open thewilkybarkid opened 6 months ago

thewilkybarkid commented 6 months ago

We're using the heavy version (v1.3.4), and I've just spotted that A population perspective on international students in Australian universities is detected as fr rather than en.

FR 4.02%
EN 3.65%
LA 2.37%
FI 2.13%
LV 2.06%

Looking at the Playground, it would be recognised as lv using the normal version:

LV 2.06%
FR 1.66%
FI 1.48%
ET 1.45%
EN 0.89%

And only correct using the light version:

EN 3.33%
FR 2.29%
NL 1.72%
FI 1.45%
IT 1.45%

I don't know much about Tatoeba. When we see incorrect detection, would it make sense to add the sentence there and hope that it triggers a tweak in this library? (A few other issues are open like this; could there be some guidance about what to do?)

thewilkybarkid commented 6 months ago

Found a couple more:

DRAFT: Developing and implementing the semantic interoperability recommendations of the EOSC Interoperability Framework is confidently la rather than en in the heavy and normal versions; this looks to be triggered by 'EOSC'. I might be able to strip out acronyms/initialisms on our side, which sees it be en in all 3 versions.

Sardegna grassland mapping for livestock management: a practical Intra-Annual NDVI contrasts approach is confidently lt in heavy, fr in the normal and en only in the light. Removing the initialism ('NDVI') sees it be fr in heavy, fr in the normal and en in the light.