Closed rspeer closed 8 years ago
So we do expect a science output change from this, right? Do we need to benchmark it?
You said that there were some small changes in Russian. Can you give me an example of something that changed? The top 1000 words appear to be identical to 1.3 in order, and I checked a couple and found that their frequencies are equal as well.
Okay, the changes come from disregarding Tweets that are shorter than 50 characters and detected as non-English.
Here are the major changes I've made in this branch: