About 200M tokens in total. Typically used for translation systems, but maybe useful to include as well (for applications ala translation matrix embeddings)?
So, many "dead" links (for News-Commentary9.1.tar.gz and first matrix), I'll update this with results.
UPD: Better use root link - http://opus.nlpl.eu/, it's useful, but need to convert this to the simpler format
Aligned corpora for many European language pair (cs-en, ru-fr, …): http://opus.nlpl.eu/News-Commentary.php
About 200M tokens in total. Typically used for translation systems, but maybe useful to include as well (for applications ala translation matrix embeddings)?