thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

Parallel Corpora for 6 Indian Languages #107

Open kpu opened 2 years ago

kpu commented 2 years ago

http://catalog.elra.info/en-us/repository/browse/ELRA-W0320/#

CC-BY-SA-3.0

Not sure why there isn't a download link from the main page, guess somebody needs to go in with an ELRA login, get it, and rehost.

mjpost commented 2 years ago

I believe this is the data that we released in this paper? In that case, there is a more direct link. I'm not sure why ELRA has appropriated it with no mention or citation.

That said, the data was translated into English by English L2 speakers. The quality isn't great, though it might serve for translating out of English.

thammegowda commented 2 years ago

Wondering why ELRA didnt mention or cite the paper! The description looks a lot similar to the one described in the paper. BTW, we have already added the joshua-decoder/indian-parallel-corpora corpus ( see mtdata list -id -g JoshuaDec). https://github.com/thammegowda/mtdata/blob/b1c0b21d3b58c0053b3a6fa669158f71b0c7f0a7/mtdata/index/joshua_indian.py#L11