piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Suggestion: parallel corpora #7

Open piskvorky opened 6 years ago

piskvorky commented 6 years ago

Aligned corpora for many European language pair (cs-en, ru-fr, …): http://opus.nlpl.eu/News-Commentary.php

About 200M tokens in total. Typically used for translation systems, but maybe useful to include as well (for applications ala translation matrix embeddings)?

menshikh-iv commented 6 years ago

So, many "dead" links (for News-Commentary9.1.tar.gz and first matrix), I'll update this with results. UPD: Better use root link - http://opus.nlpl.eu/, it's useful, but need to convert this to the simpler format