Closed 3rd-Eden closed 12 years ago
if it's not an issue with timing, you could check the char code of each letter and check if it goes out of standard ascii range (255)
clients are suppose to handle utf8 text failures (don't have the char set) gracefully :3
are utf8 tweets destroy chat rooms?
@drjackal not destroying chat rooms, no, it just may be nice to filter out non-$language tweets in certain contexts (channels) of the bot, similar to how the rate limiter works. Setting a threshold of unicode characters allowed in a tweet would probably be good at filtering out some non-english languages, if kohai wants to be english-oriented. At least, until a better system is created (e.g. actually determine language )
@irewega you can add a language filter by adding char-code regions to accept, for CJK you could look at something like this, http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
Hook.io-twitter now uses https://github.com/FGRibreau/node-language-detect - it's not perfect, but it has reduced the number of non-english tweets quite a bit. Sadly, Google's translate API is no longer running.
The only language check is the check of the user that send the message. But it's still possible for
english
account to tweet in foreign languages.There are 3 solutions for this
1 Use the Google API's to check which language the tweet (based on the confidence of the indication) 2 Count the amount of unicode characters in a tweet. As most foreign languages like Japanese, chinees use unicode based characters. These characters can be easily detected based on the byte length of a single char. ( for example using the Buffer.byteLength function). 3 Ignore it
Or am I missing something else here :)