surprising scores on short strings

blackmad commented 1 year ago

Hi! We've been playing with tinyld for identifying the language of short search queries and have been a little surprised by strings that seem pretty clearly english to us being very hard for it to give us high accuracy signals. Is it a known limitation that tinyld struggles with short text?

"search sprint 1" gives us Merge Results [ { lang: 'ga', accuracy: 0.08333333333333333 }, { lang: 'et', accuracy: 0.044066666666666664 }, { lang: 'ro', accuracy: 0.03285 }, { lang: 'es', accuracy: 0.030449999999999994 }, { lang: 'en', accuracy: 0.014425000000000002 } ]

with only=en, we get an accuracy of 0.117 for english on that string

new hire onboarding, only=en -> 0.058 codebase modularization, only=en -> 0

kefniark commented 1 year ago

Yes It's a normal problem, to avoid repeating myself I created a FAQ and answered here

I have few ideas that could help for shorter string accuracy but nothing magical

blackmad commented 1 year ago

thanks for the link, apologies we hadn't found that already.

On Thu, Nov 10, 2022 at 12:15 AM Kevin Destrem @.***> wrote:

Closed #19 https://github.com/komodojp/tinyld/issues/19 as completed.

— Reply to this email directly, view it on GitHub https://github.com/komodojp/tinyld/issues/19#event-7779696870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMZMBHUHAZRVPBK5SSGOTWHSAIHANCNFSM6AAAAAAR3ZW75U . You are receiving this because you authored the thread.Message ID: @.***>

-- David Blackman creative technologist & wandering help me find my purpose http://purpose.blackmad.com

komodojp / tinyld

surprising scores on short strings #19