malcolmgreaves / language-detection

Automatically exported from code.google.com/p/language-detection . Some after-the-fact modifications to get this working within sbt.
Apache License 2.0
5 stars 5 forks source link

regression: "no features in text" #30

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. java -jar lib/langdetect.jar --detectlang -d profiles cdebconf-km with 
09-13-2011 version

What is the expected output? What do you see instead?
Expected: cdebconf-km:[km:0.9999998969777439] (from 11-18-2010 version)
Actual: com.cybozu.labs.langdetect.LangDetectException: no features in text

What version of the product are you using? On what operating system?
09-13-2011 on Ubuntu lucid

Please provide any additional information below.

There has been a regression between 11-18-2010 and 09-13-2011 versions.  A 
large number of files that detect correctly with the earlier version now show 
"no features in text" in the later version. I have attached an example of such 
a file.

Original issue reported on code.google.com by saf...@gmail.com on 6 Dec 2011 at 2:33

Attachments:

GoogleCodeExporter commented 9 years ago
This is Khmer language, isn't it?
The 11-18-2010 version had bundled a experimental profile of Khmer by mistake 
and could detect the text.
The current version doesn't bundle it (because of no test data) and langdetect 
can't estimate features which are not contained in profiles. So the exception 
is raised.

If you want to detect Khmer, put the file of profiles/km in 11-18-2010 
version's profiles into a new profile directory.
The Khmer language profile is not tested, but I expect it go well because Khmer 
alphabet is proper! :D

Original comment by nakatani.shuyo on 6 Dec 2011 at 10:31

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Aha, I see. I have no specific interest in the Khmer language, but I am 
carrying out a large-scale comparison of off-the-shelf language identification 
systems.

So far, I have been comparing:

1) langid.py (my system!) 
http://www.csse.unimelb.edu.au/research/lt/resources/langid/
2) your system
3) TextCat (http://www.let.rug.nl/vannoord/TextCat/)
4) Chromium CLD (http://code.google.com/p/chromium-compact-language-detector/)
5) Google's langid API
6) Microsoft's langid API

I noticed that I was using a year-old version of your system, so I upgraded to 
the latest version and was surprised to find that performance dropped on many 
datasets. If you removed some languages from consideration this would explain 
the drop in performance. For your reference, in the datasets I have tested, 
your system attained 97-99% accuracy for KM in the datasets which include it.

Original comment by saf...@gmail.com on 6 Dec 2011 at 10:56

GoogleCodeExporter commented 9 years ago
So what are the accuracy as of now for each of the libraries? 

Original comment by dennis97...@gmail.com on 11 Aug 2014 at 4:50

GoogleCodeExporter commented 9 years ago
That really depends on the target data, but here[1] is my most recent paper 
comparing the accuracy of a number of 8 off-the-shelf systems on Twitter 
messages, including my own langid.py and Shuyo's language-detection (the 
repository where this message is posted).

[1] http://aclweb.org/anthology/W/W14/W14-1303.pdf

Original comment by saf...@gmail.com on 11 Aug 2014 at 11:23

GoogleCodeExporter commented 9 years ago
Thank you very much! Will read it when I'm free.

Original comment by dennis97...@gmail.com on 15 Aug 2014 at 10:18