komodojp / tinyld

Simple and Performant Language detection library for NodeJS
https://komodojp.github.io/tinyld/
MIT License
415 stars 12 forks source link

Language detection error #14

Closed erikvullings closed 2 years ago

erikvullings commented 2 years ago

I like your library/tool as it is simple to use, compact, and generally produces good results, but I do have an issue/question. Why does it classify the following block as lt?

tinyld "Russia has been stripped of a multitude of tournaments following an IOC recommendation\n\nRussian Sports Minister Oleg Matytsin has warned the world of sport that the lack of competition from banned Russian athletes is harmful for all concerned, while stating the precise number of events which his country has been stripped of due to the conflict in Ukraine.\n\nRussia has lost major sporting showpieces in recent months following a recommendation from the International Olympic Committee (IOC) at the end of February that federations should neither invite Russian athletes to competitions nor host tournaments in the country.\n\nThat has led to Russia being deprived of events such as the UEFA Champions League final, which was scheduled for St. Petersburg in May, and the world championships in volleyball and ice hockey, planned for 2022 and 2023 respectively.\n\nSports Minister Matytsin has now put an exact figure on the number of events removed from Russia.\n\n“As of May 25, international sports organizations canceled/postponed 186 international sporting events planned in Russia in 2022-2023, including 36 major international sporting events,” said Matytsin, who has been heading a Russian delegation on a visit to India.\n\nThe minister added that Russian sports officials had been tasked with seeking compensation for canceled events – something the likes of the Russian Football Union (RFU) has already said it will do with UEFA and FIFA.\n\nBut as Russian and Belarusian athletes face widespread bans, Matytsin warned that it was not only athletes from the two countries who would suffer.\n\n“This theory [of a damaging absence of competition] applies not only to us, but to all world sports – the lack of competition with Russian athletes is harmful,” Matytsin said, as quoted by RIA Sport.\n\nMatytsin has previously cautioned that world sport cannot hope to develop “normally” without the participation of Russian athletes, arguing that various federations had already come to realize their errors in attempting to alienate Russian sport.\n\nOn the flip side, Matytsin said on Thursday that Russia was also stepping up its efforts to hold tournaments for its athletes and those from other countries.\n\n“From February to May 2022, more than 30 international competitions were held in Russia,” said the minister.\n\nAfter Russian athletes were banned on the eve of the Beijing Winter Paralympics in March, Russia promptly arranged an alternative event at the Siberian resort of Khanty-Mansiysk – something it has vowed to continue to do.\n\nMatytsin has taken the opportunity of his visit to India to discuss the strengthening of sporting ties between the two countries, suggesting that Russia would be more than willing to help India with organizing a future edition of the Olympic Games, should it be granted hosted rights."
[
  { lang: 'lt', accuracy: 0.7009174311926606 },
  { lang: 'en', accuracy: 0.29908256880733947 }
]
kefniark commented 2 years ago

Sorry for the late answer, some summer vacation in the way 😄

After investigation, it looks like this specific sentence is confused by the family name, which contains unusual sequence of letters for english Matytsin repeated 8 times. In a more recent build I slightly increase the number of chunk analyzed for long texts, which reduce the risk of this happening.

For this quote, I get the following result with version 1.3.0

[
  { lang: 'en', accuracy: 0.6393910561370124 },
  { lang: 'lt', accuracy: 0.36060894386298764 }
]