I've just noticed in top hashtags that Chinese hashtags are not broken on Chinese analogues of comma (\xEF\xBC\x8C in UTF8) and point (\xE3\x80\x82 in UTF8). Most likely hashtags should be extracted using any of code points of UTF's Punctuation and Separator categories as a break character. Does anybody know how Twitter and other social networks process such a thing?
I've just noticed in top hashtags that Chinese hashtags are not broken on Chinese analogues of comma (
\xEF\xBC\x8C
in UTF8) and point (\xE3\x80\x82
in UTF8). Most likely hashtags should be extracted using any of code points of UTF's Punctuation and Separator categories as a break character. Does anybody know how Twitter and other social networks process such a thing?