Open GoogleCodeExporter opened 9 years ago
There was another thread which covered short text detection. I guess this
langugae id would be more suitable for text more than 100 or 150 characters.
Anything less, it might struggle with and mix it with other langauges. In the
summer (when I have a break from Uni) I will focus on finding a good solution
for short text (10 to 150 characters). For the time being, how often do you
expect your text to be as short as two words?
Original comment by mawa...@live.com
on 31 Mar 2011 at 9:52
Well actually, I use this solution as a language detection for queries entered
in our search engine. I was using Google's language detection API except I
found out that in the terms of service you cannot use it for an enterprise
based solution, and your IP is blocked after 1000 uses per day (our search
engine gets more hits). So actually, I would say the average search phrase is
four words, maybe 15-25 chars
Original comment by greg.geo...@gmail.com
on 1 Apr 2011 at 1:17
short text language detection may work better by using word features (such as
stop words, character-set, special words.. and so on). I am planning to work on
this in the summer during my break. For the time being, did you think of
combining several queries and then feed them into language id? or do you not
anticipate this use case? (you would expect several terms to be used in any one
session of using a search engine)
Original comment by mawa...@live.com
on 1 Apr 2011 at 6:56
Detection for short texts was already argued in Issue 8 (
http://code.google.com/p/language-detection/issues/detail?id=8 ).
The current model of langdetect is not good at short text detection... I'll
announce it in langdetect's wiki.
Original comment by nakatani.shuyo
on 4 Apr 2011 at 7:53
[deleted comment]
Is there anything new about short text analysis? I would like to replace my
very heuristic solution
(https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/tw/Tw
eetDetector.java) with your more mature approach. I'm doing a very simple
analysis based on common noise words of every language (hand collected DE and
EN + google translated to 5 other languages). One major problem is that the
tokenization is based on whitespaces ... but I like the simplicity as one can
add languages very easy ;)
Original comment by tableYou...@gmail.com
on 25 Nov 2012 at 12:02
Original issue reported on code.google.com by
greg.geo...@gmail.com
on 31 Mar 2011 at 8:50