shuyo / language-detection

This is a language detection library implemented in plain Java. (aliases: language identification, language guessing)
https://github.com/shuyo/language-detection/blob/wiki/ProjectHome.md
733 stars 184 forks source link

Detect multi languages in the same doc #7

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Thanks for all the good work you shared

Enhancement advice is needed on the best way to implement this in order to 
detect two/three (or more) languages from the same document. Any guidelines are 
welcome and I will try to implement and share any results. Thanks

Original issue reported on code.google.com by mawando@googlemail.com on 5 Feb 2011 at 6:17

GoogleCodeExporter commented 9 years ago
Thanks for your comment.
Does you imagine that one document has multi languages?
The current langdetect can only deduce probabilities of the total text, but I 
want to be able to detect text with multi languages too.

How about to split text into paragraphes and detect for each paragraphes for 
the present?

Original comment by nakatani.shuyo on 7 Feb 2011 at 3:17

GoogleCodeExporter commented 9 years ago
That's a good idea. However, if you have different languages in each paragraph 
then we're back to square one. I was more thinking of using the top probability 
as the main language and list all the other languages (above the threshold) as 
other langauges.

For instance: if you pass a text through containing French (50%) and Spanish 
(30%) and English (15%) and other languages (5%) then the output could be;

"Main Langauge: French. Text may also contain Spanish(rough%) and 
English(rough%)"

The only problem here would be consistency. The probability factor may give 
accurate favourit language but the others may vary depending on the random 
number generated. (ie in the example above the list off all languages above the 
threshold could be fr, es, en) or (fr, es) or (fr, en)... all depends on the 
random number that has been generated.

I am stuck at this point and still trying to find a way around it. Any 
suggestions would be helpful though.

The ultimate aim is to find a way to highlight the Spanish part of a given 
text, French part, the English part and any others... etc.. but I am not there 
yet. I will take this step by step.

Original comment by mawa...@live.com on 7 Feb 2011 at 8:24

GoogleCodeExporter commented 9 years ago
Your idea is excellent! But I don't come up with the way to do it...
It may be possible to detect for each line and assemble them, but precise 
detection for short text is difficult...

Original comment by nakatani.shuyo on 10 Feb 2011 at 7:00

GoogleCodeExporter commented 9 years ago
Just wanted to share with you a paper I recently came across that talks a bit 
about detecting multiple languages in short texts: 1. Hammarstrom H. A 
Fine-Grained Model for Language Identification. In: Proceedings of Improving 
Non English Web Searching (iNEWS’07).; 2007:14-20. 
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.139.6877&rep=rep1&type
=pdf#page=14)

Original comment by saf...@gmail.com on 17 Feb 2011 at 7:26

GoogleCodeExporter commented 9 years ago
very interesting read. I will spend more hours in the next few weekends to see 
the practicality of this method (especially with performance). I will also try 
and contact the author to see if any previous work can be shared. I will ensure 
I share my findings on here.
Thanks for sharing.

Original comment by mawa...@live.com on 17 Feb 2011 at 9:54

GoogleCodeExporter commented 9 years ago
Thanks, it is an interesting paper.
I imagined that very short text detection needs large dictionaries of words 
(not n-grams), but this method uses only around 1000 words for each language.
I wonder how accuracy it has for twitter-like short text with not one word but 
around 20 characters.

Original comment by nakatani.shuyo on 18 Feb 2011 at 2:51

GoogleCodeExporter commented 9 years ago
It would be helpful to be able to identify the most likely languages of an 
entire document. Yes, that would be very helpful. 

But such an approach this is not sufficient in some text applications.

For example, consider automatic text classification of a web page written in a 
mix of Danish and English. That is, the topic of the text is identified 
automatically. A text classifier for Danish must be applied to the Danish text. 
A text classifier for English must be applied to the English text.

So it is necessary to have a map of the document, with probabilities assigned 
to each segment. For example from byte 0 to 1305, the probable language might 
be Danish (95%) with Swedish (%5), while from byte 1306 to 4000, the probable 
language might be English (99.9%).

Given this probabilistic language map of the document, an automatic text 
classification application might then determine

   - section 1 [0,1305] of document is in Danish with topic /science/physics
   - section 2 [1306,4000] of document is in English with topic /science/physics

But to do this, a probabilistic language map is necessary. Otherwise the right 
classifiers cannot be targeted at the right segments of the document.

Original comment by sfgo...@gmail.com on 12 Jun 2014 at 12:27