miguelfreitas / twister-core

twister core / daemon
MIT License
1.42k stars 251 forks source link

Split hashtags using utf8 characters #379

Open dryabov opened 8 years ago

dryabov commented 8 years ago

I've just noticed in top hashtags that Chinese hashtags are not broken on Chinese analogues of comma (\xEF\xBC\x8C in UTF8) and point (\xE3\x80\x82 in UTF8). Most likely hashtags should be extracted using any of code points of UTF's Punctuation and Separator categories as a break character. Does anybody know how Twitter and other social networks process such a thing?

miguelfreitas commented 8 years ago

With US$ 3.50 Billion on cash i doubt they would't have noticed such a thing ;-)