karussell / Jetwick

[not maintained] Custom Twitter Search via ElasticSearch&Wicket
61 stars 15 forks source link

Better Language Detection based on word list #5

Closed karussell closed 13 years ago

karussell commented 13 years ago

at the momentan language detection is based on noise words (~200 words) only

pannous commented 13 years ago

use google translate (to english) when indexing. does it give you the recognized lang?

karussell commented 13 years ago

ah, yes. while indexing I could detect the language + even translate it. But I am really sure that google will kick me out if I'm hitting their service >1mio per day :-)

pannous commented 13 years ago

True. You could translate in batches though. (100 tweets = 1 'paragraph')? I am really sure or "I am not really sure"? ;} Or let's just user more than 200 words ;]

karussell commented 13 years ago

True. You could translate in batches though.

but then I'll have to be sure that all the batched tweets have the same language ...

what do you mean with: "I am really sure or "I am not really sure"? ;} Or let's just user more than 200 words ;]"

karussell commented 13 years ago

This is fixed now with an english wordlist of 2.6k words then translated to various languages. btw I added french and portuguese

I'll need to push this to github though ...