rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

wordfreq 1.4: some bigger wordlists, better use of language detection #34

Closed rspeer closed 8 years ago

rspeer commented 8 years ago

Here are the major changes I've made in this branch:

alin-luminoso commented 8 years ago

So we do expect a science output change from this, right? Do we need to benchmark it?

rspeer commented 8 years ago

You said that there were some small changes in Russian. Can you give me an example of something that changed? The top 1000 words appear to be identical to 1.3 in order, and I checked a couple and found that their frequencies are equal as well.

rspeer commented 8 years ago

Okay, the changes come from disregarding Tweets that are shorter than 50 characters and detected as non-English.