thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

Not all available `ELRC` datasets are downloaded from OPUS #132

Closed ZenBel closed 2 years ago

ZenBel commented 2 years ago

Hi all,

I have refreshed opus_index.tsv using the following commands as specified in opus_index.py:

$ curl "https://opus.nlpl.eu/opusapi/?preprocessing=moses" > opus_all.json 
$ cat opus_all.json |  jq -r  '.corpora[] | [.corpus, .version, .source, .target] | @tsv'  | sort  > opus_all.tsv 

and then mv opus_all.tsv opus_index.tsv. The new opus_index.tsv contains additional ELRC datasets, like ELRC-3083-wikipedia_health v1 ar en

However, when I run mtdata list -l eng-ara, this new dataset is not present. Any hint as to why this is the case?

Thanks in advance,

Z

P.S. The dataset is actually pulled by elrc_share.py but, still, adding it to the opus_index.tsv should allow also opus_index.py to pull it.

thammegowda commented 2 years ago

please pass --reindex flag https://github.com/thammegowda/mtdata/blob/72df36d1eb3a8295db273892c29b11cd11ef9b4b/mtdata/main.py#L129

Usually, the reindex happens automatically when we update version number, since we didnt change version number, we have to force reindex.

ZenBel commented 2 years ago

Oh, I missed that!

Thanks for pointing it out. mtdata -ri list -l eng-ara fixes the problem.