nodejitsu / kohai

I am kohai. I am a pluggable irc bot for managing real-time data events.
91 stars 17 forks source link

Better tweet language detection #50

Closed 3rd-Eden closed 12 years ago

3rd-Eden commented 13 years ago

The only language check is the check of the user that send the message. But it's still possible for english account to tweet in foreign languages.

There are 3 solutions for this

1 Use the Google API's to check which language the tweet (based on the confidence of the indication) 2 Count the amount of unicode characters in a tweet. As most foreign languages like Japanese, chinees use unicode based characters. These characters can be easily detected based on the byte length of a single char. ( for example using the Buffer.byteLength function). 3 Ignore it

Or am I missing something else here :)

jamesonjlee commented 13 years ago

if it's not an issue with timing, you could check the char code of each letter and check if it goes out of standard ascii range (255)

clients are suppose to handle utf8 text failures (don't have the char set) gracefully :3

are utf8 tweets destroy chat rooms?

lrewega commented 13 years ago

@drjackal not destroying chat rooms, no, it just may be nice to filter out non-$language tweets in certain contexts (channels) of the bot, similar to how the rate limiter works. Setting a threshold of unicode characters allowed in a tweet would probably be good at filtering out some non-english languages, if kohai wants to be english-oriented. At least, until a better system is created (e.g. actually determine language )

jamesonjlee commented 13 years ago

@irewega you can add a language filter by adding char-code regions to accept, for CJK you could look at something like this, http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml

AvianFlu commented 12 years ago

Hook.io-twitter now uses https://github.com/FGRibreau/node-language-detect - it's not perfect, but it has reduced the number of non-english tweets quite a bit. Sadly, Google's translate API is no longer running.