saffsd / langid.py

Stand-alone language identification system
Other
2.31k stars 317 forks source link

wrong detection #40

Closed canhduong28 closed 8 years ago

canhduong28 commented 9 years ago

Hello,

with an english text "Ángel Di María: Louis van Gaal dynamic was why I left Manchester United", the classifier returns ('la', 0.9665266986710674) because of the "Ángel Di María" is a Latin name.

Is there any way to overcome this situation?

Thanks in advance, Canh

tripleee commented 9 years ago

This is going to be a problem with any short input. If you need high precision for short inputs, maybe you could preprocess it so that proper names are normalized or neutralized one way or another? Then your sample becomes e.g. "A: B dynamic was why I left C". This is a hard topic unto itself, though. (Named Entity Recognition could help but state of the art is nothing like 100% accuracy.)

canhduong28 commented 9 years ago

@tripleee thanks for your comment.

canhduong28 commented 9 years ago

In [3]: text = """Kojima truly has a variety of friends. The game designer took to Twitter today to show off images of him hanging out with Star Wars: The Force Awakens director J.J. Abrams and the astromech droid BB-8. \"JJ, who has been supporting the project for a long time, has also been told that MGS V is complete,\" he said, pictured in a photo with J.J. and a copy of Metal Gear Solid V: The Phantom Pain. ずっと応援してくれていたJJにも「MGSV TPP」完成を報告。 pic.twitter.com/qajeAizxYx — 小島秀夫 (@Kojima_Hideo) August 28, 2015. Known best for his work on the Metal Gear series, Kojima is a veteran game designer who has spent much of his career working on games for Konami. Recent evidence suggests, however, that he may no longer be working for Konami. Read IGN's Metal Gear Solid V review to learn why it earned a score of 10. Cassidee is a freelance writer and the co-host of a podcast about freelancing. You can chat with her about that and all other things geeky on Twitter."""

In [4]: langid.classify(text) Out[4]: ('la', 0.9999999125266582)

Is it true that unicode characters have a greater weight than other ascii characters? The above text should be detected as English, is there any solution for this?

Thanks in advance, Canh

bittlingmayer commented 8 years ago

I've done a quick evaluation and would suggest a few basic improvements: 1) addition of 'und' (undetermined) when really the word is not in any language 2) give weight to known stop words 3) give weight to punctuation 4) consider that certain characters only occur in certain languages 5) train on translit too to identify languages like Arabic, Russian, Hindi and Chinese when they are written in the Latin script

langid.classify("haha")
('en', 0.16946150595865334)
langid.classify("!!!")
('en', 0.16946150595865334)
langid.classify("no")
('en', 0.16946150595865334)
langid.classify("no!")
('en', 0.16946150595865334)
langid.classify("¡No!")
('zh', 0.2249412262395412)

The last should of course be Spanish, not Chinese.

langid.classify("yeah haha")
('id', 0.4730470342933074)
langid.classify("ты я меня так что да нет не же")
('bg', 0.6202337036529055)

Note, that is unambiguously Russian, not Bulgarian. Most of the words and even one of the characters, 'ы', is unknown in Bulgarian.

langid.classify("jajaj")
('en', 0.16946150595865334)
langid.classify("jaja")
('en', 0.16946150595865334)

I think Spanish or German are more reasonable guesses here, using words or character-n-grams-based approaches.

langid.classify("ty kuda edish' seychas?")
('es', 0.48318636730355763)

This is actually Russian translit.

langid.classify("asdfk94jlskdle")
('en', 0.16946150595865334)

In my opinion, this should return ('und', 0.99).

saffsd commented 8 years ago

@bittlingmayer thanks for the suggestions and the excellent examples. The suggestions are all very sensible, but not very easy to implement in practice. In designing and training langid.py we tried to avoid introducing any hand-crafted features. We did this by using collections of documents in known languages and detecting essentially byte patterns that were characteristic of specific languages. This naturally selects some patterns and stopwords (see list of features). Your suggestions identify several weaknesses in the approach, but unfortunately I don't know any way to integrate them into the existing method in a way that I am confident would improve performance across the board:

1) there is no easy way to represent or train for the "und" class. It may be possible to learn a per-language threshold but our early experiments in this led nowhere. 2) it's not easy to determine stopwords across 97 languages 3) punctuation is not always easy to detect, and many languages share it. any punctuation that is particularly characteristic of a language should have been detected in the generation of our feature set 4) again, if there is evidence in our training data for certain characters being language specific, they should have been detected. It may well be the case that our training data contains mislabeled documents. However, in the Russian/Bulgarian example you give I suspect the character you mention isn't a feature at all, so no information from it is being used in the classification 5) that's an idea we discussed but never found the time to implement. I also think it is possible but don't have a good source of training and test data for it.

Please don't take this as a rejection of your suggestions, I think they are all sensible and I hope to have provided some context as to why we didn't do some of them. Unfortunately I'm not able to dedicate the amount of time and effort required to developing any of them with the thoroughness that would be required to make them work. There is clearly some demand for better performance on short input, and that has probably been the biggest shortcoming of langid.py, so perhaps someone else in the research community will take note and further develop tools in this area.

bittlingmayer commented 8 years ago

I understand the healthy impulse to avoid too much hand-crafting, but given that langid is a relatively static problem I see value in some sort of ultimate dataset against which a system can do a lookup for the top 10000 (ie top 90%) of queries.

1) I would perhaps just make some rules. There are good libs and gists out there to match URLs, email addresses, numbers, emoji, mixed alphanumeric and other types of non-words. Even if it doesn't catch all non-language, it will be better than now. Another approach is to just return 'und' if no language is sufficiently probable nor more probable than others. (I mean, what does it mean if a decently long sentence has no probability higher than 10%, split roughly evenly between 5 languages that are not related to each other?)

2) It can be done in a way where we get a performance boost even with just a list of stopwords for the top few languages. For such lists, many have used NLTK (eg https://gist.github.com/bittlingmayer/ba17969070c2749b478f). It has it's strengths but it's far from perfect, I would avoid relying on this approach as more than a confirmation, or otherwise penalising languages for which there are not stopwords. (eg, we can use it as a tiebreaker between two languages for which we have a list - say en/de/es.)

3) I think punct is one of those things worth handcoding, although I agree the model should handle punct and special chars like any alpha char. Regarding "¡No!" returning 'zh', I think we need to dig deeper. It's likely a fundamental bug.

4) The issue (not unique to this lib of course) with summative models is that there are points awarded to a lang for having a char, but no concept of taking points away. So let's say we have some long string "Xxxx Xxxxxxx xxxxx xx xxxxxx xx Xxxxx." (where X/x are real chars). And we must decide between Swedish and Danish, for some set of words and characters that are very similar in both. For the presence of 'ä', we could know it's not Danish, but under the current regime, in a long string, there is hardly any boost from that - it can easily get lost in the noise.

5) This I understand is a bit more of an undertaking. The safest and easiest way I have seen it done is to just programmatically produce translit from proper text. (Same as for producing Latin-alphabet languages like Spanish or German without accent marks.)

Re features, I guess the implicit goal is to compress it a bit to keep the library size small? Because I would expect a string like "Jedenfalls empfehlenswert!" to be classified correctly every time. (2 long words both unique to one big language.) So we only back off to char n-grams if there is no match.

I have certainly found that training data are frequently contaminated. This often occurs with false equations of country with language (eg .ba -> bs, although I must say langid.py performs well on that normally problematic set of languages).

Overall I hope you see that I am not suggesting to solve any of these issues 100%, but to incorporate a good-enough solution in such a way as to get some upside with no downside, cheaply.

bittlingmayer commented 8 years ago

As a note, we essentially already have 'und', as far as I can tell.

langid.classify("haha")
('en', 0.16946150595865334)
langid.classify("!!!")
('en', 0.16946150595865334)
langid.classify("no")
('en', 0.16946150595865334)
langid.classify("no!")
('en', 0.16946150595865334)
langid.classify("asdfk94jlskdle")
('en', 0.16946150595865334)

So

if classify(x) == ('en', 0.16946150595865334):
  return ('und', 0.5)

:-)