samehuman / cld2

Automatically exported from code.google.com/p/cld2
0 stars 0 forks source link

No langauges output despite isReliable=True #1

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I'm using Mike McCandless' Python binding to cld2. I originally reported this 
issue to him, and he suggested I report it here (see 
https://code.google.com/p/chromium-compact-language-detector/issues/detail?id=15
).

The issue is that for a particular input string, cld2 reports that the 
prediction is reliable, but the set of languages detected is empty.

What steps will reproduce the problem?
1. import cld2
2. cld2.detect('interaktive infografik \xc3\xbcber videospielkonsolen')

What is the expected output? What do you see instead?
The output is 

(True, 49, ())

What version of the product are you using? On what operating system?
Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2

cld2 was built using SVN rev 63, 
cld python module was built using hg changeset b1cad3f04ef4

Original issue reported on code.google.com by saf...@gmail.com on 6 Aug 2013 at 3:08

GoogleCodeExporter commented 9 years ago
Working as intended.
The inner call of CLD2 returns the total number of bytes of text found, a list 
of three languages, their three percentages of the total text bytes, and a 
reliability Boolean.

There are several degenerate cases possible
- UNKNOWN_LANGUAGE is a valid language and may show up in the list (some 
webcruft strings such as "http" or "jpg" may deliberately match 
UNKNOWN_LANGUAGE to prevent them from falsely indicating Somali or somesuch).
- The three percentages in general will total less than 100%, implying that the 
remainder of the text is UNKNOWN_LANGUAGE.
- The percentage of the top language might be small but non-zero, meaning that 
any other detected languages are a smaller percentage and the rest is unknown.
- The percentage of the top language might be 0%, meaning 100% unknown.
- Several languages are detected but they differ so slightly or they score much 
too low or much too high compared to real text in each language, so the 
reliability Boolean is set to false.

In your particular example, only four letter groups score:
  fogr fik_ _über_ spie

Other letter groups such as 
  _inte tive_ _info _vide deos 
occur in so many different languages that they are ignored. The letter sequence 
"_über_" is strongly German, but not much else is, so the German language 
score is too low for a normal 49 bytes of German. (And 49 bytes is too low for 
CLD2 to do well -- two sentences is a more reasonable amount of input; CLD2's 
design center is real text from web pages, not 1-4 word fragments from searches 
or Twitter or suchlike.) 

The letter sequence "spie" occurs about equally in German and Latvian, so the 
overall score separation between those two ends up too low. In the end, both 
languages are dropped entirely with too few useful table hits, leaving 100% 
"other". The reliability bit is essentially over the null set of returned 
languages in this case.

I haven't looked carefully at the Python wrapper, but Mike may want to expose 
the percentages or set the reliability bool to false in more of the degenerate 
cases above. /dick

Original comment by dsi...@google.com on 9 Aug 2013 at 6:17

GoogleCodeExporter commented 9 years ago

Original comment by dsi...@google.com on 9 Aug 2013 at 6:39

GoogleCodeExporter commented 9 years ago
OK I fixed the Python bindings to always return 3 languages even when some of 
them are UNKNOWN (previously I would skip UNKNOWN), and added a test case.

Original comment by luc...@mikemccandless.com on 9 Aug 2013 at 8:09

GoogleCodeExporter commented 9 years ago
I'm off on vacation in upstate Wisconsin for a week, back on the 20th. At
that time, I plan to tweak CLD2 to return unreliable if the top language is
less than 2% of the total text -- this will also cover the all-unknown case.

On Friday, August 9, 2013, wrote:

Original comment by dsi...@google.com on 10 Aug 2013 at 8:04

GoogleCodeExporter commented 9 years ago
Updated to return is_reliable=false if top language is UNKNOWN_LANGUAGE.

Original comment by dsi...@google.com on 20 Aug 2013 at 9:22