thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
145 stars 22 forks source link

Fixed a bug in KECL JParaCrawl v3 extraction used in WMT22 en-ja translation task #117

Closed de9uch1 closed 2 years ago

de9uch1 commented 2 years ago

Fixed a bug in the extraction of JParaCrawl v3 used in WMT22 en-ja translation task. The minor version has been bumped to need to update the index cache.

de9uch1 commented 2 years ago

Note that JParaCrawl has a different format for v2 and v3, and the tsv columns that should be extracted have changed.

thammegowda commented 2 years ago

Thanks @de9uch1 for this PR!

I hope you don't mind me taking these changes into develop branch first and releasing a new version along with a few other changes.