Open WilliamTambellini opened 4 years ago
some more examples:
text: Bonjour language: no probability: 0.933857 reliable: 1 proportion: 1
text: Bonjour le monde language: fr probability: 0.999432 reliable: 1 proportion: 1
I'm finding similar results with the Ruby cld3 gem.
CLD3::NNetLanguageIdentifier.new(0, 1000).find_language('hello world')
=> #<struct Struct::Result language=:ky, probability=0.7191877961158752, :reliable?=true, proportion=1.0>
CLD3::NNetLanguageIdentifier.new(0, 1000).find_language('1 hour ago')
=> #<struct Struct::Result language=:pt, probability=0.9975216388702393, :reliable?=true, proportion=1.0>
I found that it works great for English and German when the text is longer than 30 chars. But a concrete minimum would be great.
Another example from v3.6.0:
CLD3::NNetLanguageIdentifier.new(0, 1000).find_language("User ID FA1324102A6E2C72 How to add my name on leader board?")
=> #<struct CLD3::NNetLanguageIdentifier::Result language=:ja, probability=0.7837570905685425, reliable?=true, proportion=1.0, byte_ranges=[]>
CLD3::NNetLanguageIdentifier.new(0, 1000).find_language("AAAA AA A0000000A0A0AAA AAA AA AAA AA AAAA AA AAAAAA")
=> #<struct CLD3::NNetLanguageIdentifier::Result language=:ja, probability=0.7837570905685425, reliable?=true, proportion=1.0, byte_ranges=[]>
How in the world that text is detected as JA 🤯
Hi Could anyone just confirm that short inputs are usually not correctly identified by CLD3 ? Some examples:
text: Hello language: sr probability: 0.830728 reliable: 1 proportion: 1
text: Hello world language: ky probability: 0.719188 reliable: 1 proportion: 1
text: Hello my world language: ky probability: 0.521224 reliable: 0 proportion: 1
text: Hello my great world language: ja probability: 0.278577 reliable: 0 proportion: 1
text: Hello the great world of Artificial Intelligence language: en probability: 0.980107 reliable: 1 proportion: 1