gordicaleksa / Open-NLLB

Effort to open-source NLLB checkpoints.
MIT License
419 stars 37 forks source link

feat:Downloading mined bitext #24

Closed vienneraphael closed 1 year ago

vienneraphael commented 1 year ago

Partially closes #11 (analysis needs to be added) Allows downloading lang pairs from allenai nllb dataset (huggingface dataset): https://huggingface.co/datasets/allenai/nllb/tree/main

I've stored the NLLB_PAIRS (pairs released with NLLB paper) and CCMATRIX_PAIRS (pairs for which NLLB paper reused previous CCMatrix dataset) in a separate python file to import these variables into the main script because these two are rather big and could impair code readability.

Resulting tree structure with flag --minimal (debug mode) enabled:

downloads/nllb/ ├── ace_Latn-ban_Latn │ ├── allenai.nllb.ace_Latn │ └── allenai.nllb.ban_Latn └── amh_Ethi-nus_Latn ├── allenai.nllb.amh_Ethi └── allenai.nllb.nus_Latn

I'm planning on doing analysis as the next steps to this PR.