thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
145 stars 22 forks source link

Add `allenai/nllb` dataset #133

Closed ZenBel closed 1 year ago

ZenBel commented 1 year ago

Link: https://huggingface.co/datasets/allenai/nllb

AlexUmnov commented 1 year ago

@thammegowda

So the options are: 1) Including HF datasets dependency (which is quite large) 2) Reverse engineering the link to the dataset

Or also they provide an option to do it through git-lfs. There's a lib for that https://pypi.org/project/git-lfs/. What do you think about it?

thammegowda commented 1 year ago

I think (2) reverse engineering the links to mtdata would be preferred, that way we don't have to include all the dependencies of HF datasets. If (2) is not feasible or too complicated, we shall consider (1).