gordicaleksa / Open-NLLB

Effort to open-source NLLB checkpoints.
MIT License
419 stars 37 forks source link

download allenai nllb mined bitext #25

Open vienneraphael opened 1 year ago

vienneraphael commented 1 year ago

Partially closes https://github.com/gordicaleksa/Open-NLLB/issues/11 (analysis needs to be added) Allows downloading lang pairs from allenai nllb dataset (huggingface dataset): https://huggingface.co/datasets/allenai/nllb/tree/main

I've stored the NLLB_PAIRS (pairs released with NLLB paper) and CCMATRIX_PAIRS (pairs for which NLLB paper reused previous CCMatrix dataset) in a separate python file to import these variables into the main script because these two are rather big and could impair code readability.

Resulting tree structure with flag --minimal (debug mode) enabled:

downloads/nllb/ ├── ace_Latn-ban_Latn │ ├── allenai.nllb.ace_Latn │ └── allenai.nllb.ban_Latn └── amh_Ethi-nus_Latn ├── allenai.nllb.amh_Ethi └── allenai.nllb.nus_Latn

I'm planning on doing analysis as the next steps to this PR.

vienneraphael commented 1 year ago

Looking great!

Can you please test on a couple of directions?

I had issues with Serbian for example, I think I tried eng_Latn-srp_Cyrl or Latn.

I myself tried with eng_Latn-srp_Cyrl and it worked (though i had to stop the program because that was really too much data). Still i can download it!

As for eng_Latn-srp_Latn, this lang pair doesn't appear in the error, nor in the NLLB_PAIRS, so I guess it isn't there in the dataset.

vienneraphael commented 1 year ago

I'm realizing there are some pairs for which the dataset viewer doesn't work on hugginface. image

However, these pairs are not necessarily problematic, since i managed to download srp_Cyrl-tur_Latn