Language detection accuracy score changes with "only" restrict

blackmad commented 1 year ago

Hey all -

We noticed that tinyld was chewing through a lot of CPU so wanted to use the "only" parameter to detectAll to restrict the search space, but we were surprised to discover that changes the accuracy scores.

Is this expected?

"""

tinyld.detectAll("the quick brown fox") [ { lang: 'la', accuracy: 0.125 }, { lang: 'en', accuracy: 0.08101249999999999 }, { lang: 'es', accuracy: 0.03479375 }, { lang: 'fr', accuracy: 0.030831249999999998 }, { lang: 'ca', accuracy: 0.027387500000000002 }, { lang: 'it', accuracy: 0.013324999999999997 }, { lang: 'pt', accuracy: 0.004968750000000001 }, { lang: 'ga', accuracy: 0.003924999999999998 } ] tinyld.detectAll("the quick brown fox", {only: ["en", "de"]}) [ { lang: 'en', accuracy: 0.1875 } ] tinyld.detectAll("the quick brown fox", {only: ["en", "de", "la"]}) [ { lang: 'la', accuracy: 0.125 }, { lang: 'en', accuracy: 0.08101249999999999 } ] """

kefniark commented 1 year ago

Hi there,

Yes this is kinda expected, internally the score is just a big number without unit. And to be more usable and understandable, this score is normalized between 0-1 to become an accuracy percentage (https://github.com/komodojp/tinyld/blob/develop/src/tokenizer.ts#L104). This normalization is just a simple linear transformation based on all results, so when you filter some countries it change the normalization operation. Basically, the accuracy is just there just to give some hint and explain the result, it is not an accurate isolated measurement.

The main question here is why it's chewing up your CPU, what kind of usage are you doing of it? Maybe there is an unnoticed problem somewhere else, and even if not, there are probably few ways to fine tune the library and improve that. But I never see such case, usually the api is sub millisecond and can be run few thousand times per second even on a VM. (example)

kefniark commented 1 year ago

Sadly never got any feedback on that, closing the issue for inactivity

komodojp / tinyld

Language detection accuracy score changes with "only" restrict #18