This version is a minor version bump, because it updates the data while leaving the API the same. Significant changes in this data include:
Added ParaCrawl, a multilingual Web crawl, as a data source. This supplements the Leeds Web crawl with more modern data.
ParaCrawl seems to provide a more balanced sample of Web pages than Common Crawl, which we once considered adding, but found that its data heavily overrepresented TripAdvisor and Urban Dictionary in a way that was very apparent in the word frequencies.
ParaCrawl has a fairly subtle impact on the top terms, mostly boosting the frequencies of numbers and months. As with other data sources, the code for processing it and obtaining its word frequencies doesn't appear here, but in exquisite-corpus.
Fixes to inconsistencies where words from different sources were going through different processing steps. As a result of these inconsistencies, some word lists contained words that couldn't actually be looked up because they would be normalized to something else.
All words should now go through the aggressive normalization of lossy_tokenize.
Fixes to inconsistencies regarding what counts as a word. Non-punctuation, non-emoji symbols such as = were slipping through in some cases but not others.
As a result of the new data, Latvian becomes a supported language and Czech (this surprised me too) gets promoted to a 'large' language.
As a non-data change, the tests are now ported to pytest.
Wasn't sure where to leave this, but figured you'd see it here: there is a lingering code review note from the 2018-05-18 code review that CHANGELOG.md contains the typo "betwen".
This version is a minor version bump, because it updates the data while leaving the API the same. Significant changes in this data include:
Added ParaCrawl, a multilingual Web crawl, as a data source. This supplements the Leeds Web crawl with more modern data.
ParaCrawl seems to provide a more balanced sample of Web pages than Common Crawl, which we once considered adding, but found that its data heavily overrepresented TripAdvisor and Urban Dictionary in a way that was very apparent in the word frequencies.
ParaCrawl has a fairly subtle impact on the top terms, mostly boosting the frequencies of numbers and months. As with other data sources, the code for processing it and obtaining its word frequencies doesn't appear here, but in exquisite-corpus.
Fixes to inconsistencies where words from different sources were going through different processing steps. As a result of these inconsistencies, some word lists contained words that couldn't actually be looked up because they would be normalized to something else.
All words should now go through the aggressive normalization of
lossy_tokenize
.Fixes to inconsistencies regarding what counts as a word. Non-punctuation, non-emoji symbols such as
=
were slipping through in some cases but not others.As a result of the new data, Latvian becomes a supported language and Czech (this surprised me too) gets promoted to a 'large' language.
As a non-data change, the tests are now ported to
pytest
.