google / cld3

Apache License 2.0
776 stars 109 forks source link

Bad identification for short input #31

Open WilliamTambellini opened 4 years ago

WilliamTambellini commented 4 years ago

Hi Could anyone just confirm that short inputs are usually not correctly identified by CLD3 ? Some examples:

text: Hello language: sr probability: 0.830728 reliable: 1 proportion: 1

text: Hello world language: ky probability: 0.719188 reliable: 1 proportion: 1

text: Hello my world language: ky probability: 0.521224 reliable: 0 proportion: 1

text: Hello my great world language: ja probability: 0.278577 reliable: 0 proportion: 1

text: Hello the great world of Artificial Intelligence language: en probability: 0.980107 reliable: 1 proportion: 1

WilliamTambellini commented 4 years ago

some more examples:

text: Bonjour language: no probability: 0.933857 reliable: 1 proportion: 1

text: Bonjour le monde language: fr probability: 0.999432 reliable: 1 proportion: 1

rstacruz commented 4 years ago

I'm finding similar results with the Ruby cld3 gem.

CLD3::NNetLanguageIdentifier.new(0, 1000).find_language('hello world')
=> #<struct Struct::Result language=:ky, probability=0.7191877961158752, :reliable?=true, proportion=1.0>
CLD3::NNetLanguageIdentifier.new(0, 1000).find_language('1 hour ago')
=> #<struct Struct::Result language=:pt, probability=0.9975216388702393, :reliable?=true, proportion=1.0>
thomasrosen commented 2 years ago

I found that it works great for English and German when the text is longer than 30 chars. But a concrete minimum would be great.

mariozaizar commented 7 months ago

Another example from v3.6.0:

CLD3::NNetLanguageIdentifier.new(0, 1000).find_language("User ID FA1324102A6E2C72 How to add my name on leader board?")
=> #<struct CLD3::NNetLanguageIdentifier::Result language=:ja, probability=0.7837570905685425, reliable?=true, proportion=1.0, byte_ranges=[]>

CLD3::NNetLanguageIdentifier.new(0, 1000).find_language("AAAA AA A0000000A0A0AAA AAA AA AAA AA AAAA AA AAAAAA")
=> #<struct CLD3::NNetLanguageIdentifier::Result language=:ja, probability=0.7837570905685425, reliable?=true, proportion=1.0, byte_ranges=[]>

How in the world that text is detected as JA 🤯