Open GoogleCodeExporter opened 8 years ago
Thanks for your comment.
Does you imagine that one document has multi languages?
The current langdetect can only deduce probabilities of the total text, but I
want to be able to detect text with multi languages too.
How about to split text into paragraphes and detect for each paragraphes for
the present?
Original comment by nakatani.shuyo
on 7 Feb 2011 at 3:17
That's a good idea. However, if you have different languages in each paragraph
then we're back to square one. I was more thinking of using the top probability
as the main language and list all the other languages (above the threshold) as
other langauges.
For instance: if you pass a text through containing French (50%) and Spanish
(30%) and English (15%) and other languages (5%) then the output could be;
"Main Langauge: French. Text may also contain Spanish(rough%) and
English(rough%)"
The only problem here would be consistency. The probability factor may give
accurate favourit language but the others may vary depending on the random
number generated. (ie in the example above the list off all languages above the
threshold could be fr, es, en) or (fr, es) or (fr, en)... all depends on the
random number that has been generated.
I am stuck at this point and still trying to find a way around it. Any
suggestions would be helpful though.
The ultimate aim is to find a way to highlight the Spanish part of a given
text, French part, the English part and any others... etc.. but I am not there
yet. I will take this step by step.
Original comment by mawa...@live.com
on 7 Feb 2011 at 8:24
Your idea is excellent! But I don't come up with the way to do it...
It may be possible to detect for each line and assemble them, but precise
detection for short text is difficult...
Original comment by nakatani.shuyo
on 10 Feb 2011 at 7:00
Just wanted to share with you a paper I recently came across that talks a bit
about detecting multiple languages in short texts: 1. Hammarstrom H. A
Fine-Grained Model for Language Identification. In: Proceedings of Improving
Non English Web Searching (iNEWS’07).; 2007:14-20.
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.139.6877&rep=rep1&type
=pdf#page=14)
Original comment by saf...@gmail.com
on 17 Feb 2011 at 7:26
very interesting read. I will spend more hours in the next few weekends to see
the practicality of this method (especially with performance). I will also try
and contact the author to see if any previous work can be shared. I will ensure
I share my findings on here.
Thanks for sharing.
Original comment by mawa...@live.com
on 17 Feb 2011 at 9:54
Thanks, it is an interesting paper.
I imagined that very short text detection needs large dictionaries of words
(not n-grams), but this method uses only around 1000 words for each language.
I wonder how accuracy it has for twitter-like short text with not one word but
around 20 characters.
Original comment by nakatani.shuyo
on 18 Feb 2011 at 2:51
It would be helpful to be able to identify the most likely languages of an
entire document. Yes, that would be very helpful.
But such an approach this is not sufficient in some text applications.
For example, consider automatic text classification of a web page written in a
mix of Danish and English. That is, the topic of the text is identified
automatically. A text classifier for Danish must be applied to the Danish text.
A text classifier for English must be applied to the English text.
So it is necessary to have a map of the document, with probabilities assigned
to each segment. For example from byte 0 to 1305, the probable language might
be Danish (95%) with Swedish (%5), while from byte 1306 to 4000, the probable
language might be English (99.9%).
Given this probabilistic language map of the document, an automatic text
classification application might then determine
- section 1 [0,1305] of document is in Danish with topic /science/physics
- section 2 [1306,4000] of document is in English with topic /science/physics
But to do this, a probabilistic language map is necessary. Otherwise the right
classifiers cannot be targeted at the right segments of the document.
Original comment by sfgo...@gmail.com
on 12 Jun 2014 at 12:27
Original issue reported on code.google.com by
mawando@googlemail.com
on 5 Feb 2011 at 6:17