url often generates lang:en on small text

jprante / elasticsearch-langdetect

A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector

Apache License 2.0

251 stars 46 forks source link

url often generates lang:en on small text #17

Closed juliendangers closed 10 years ago

juliendangers commented 10 years ago

On small text with url in it, english is almost always detected

Example :

an arabic tweet with an url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\" https://www.facebook.com/dralqarnee/posts/675689432512881"
}

Produces :

{
   "languages": [
      {
         "language": "en",
         "probability": 0.857138346512083
      },
      {
         "language": "ar",
         "probability": 0.14285639031760403
      }
   ]
}

English is detected with a greater probability...

Without any url :

POST _langdetect?pretty
{
  "query_string": "RT @Dr_alqarnee: \"رمضان شهر الرحمة بالمسلمين\""
}

Produces :

{
   "languages": [
      {
         "language": "ar",
         "probability": 0.5714272046098048
      },
      {
         "language": "so",
         "probability": 0.42857034099037317
      }
   ]
}

english is not even detected !

I can submit a pull request, I've already done the changes on my own.

jprante commented 10 years ago

Yes, the input data can not be reliable processed if text is either short (single words) or short and mixed. To me it makes sense: in first text there is the word facebook and posts, in the second there is no english word.

This restriction is due to the underlying lang detect module, this plugin can not change this.

juliendangers commented 10 years ago

Yes I agree that it makes sense that english is detected with the url in it. But I do not see the sense of using url in language detection.

I've done the following :

added a pattern for url

private final static Pattern urlPattern = Pattern.compile("^(https?|ftp)://[^\\s/$.?#].[^\\s]*$",Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);

(not sure Pattern.UNICODE_CHARACTER_CLASS is necessary here)

replaced

text.replaceAll(word.pattern(), " ")

text.replaceAll(word.pattern(), " ").replaceAll(urlPattern.pattern(), " ")

in Detector.detect and Detector.detectAll

But you're right, this should be done in the underlying lang detect module, I'm going to submit a PR to it.

This issue can be closed, don't you think ?

jprante commented 10 years ago

I see the point that URL is not text. But there is many data that is not text. So I think URL/URI is only one example.

For this plugin, I think the most viable approach is to only use input for lang detect that is preprocessed in the sense that it is recognizable language.

Most general approach would be part-of-speech (POS) tagging like in natural language processing / text mining. It would be a good idea to combine POS tagger with language detection like this plugin can do.