rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

Leave Thai segments alone in the default regex #32

Closed rspeer closed 8 years ago

rspeer commented 8 years ago

Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token.

The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback.

alin-luminoso commented 8 years ago

LGTM aside from one nitpick.