I've stored the NLLB_PAIRS (pairs released with NLLB paper) and CCMATRIX_PAIRS (pairs for which NLLB paper reused previous CCMatrix dataset) in a separate python file to import these variables into the main script because these two are rather big and could impair code readability.
Resulting tree structure with flag --minimal (debug mode) enabled:
Partially closes #11 (analysis needs to be added) Allows downloading lang pairs from allenai nllb dataset (huggingface dataset): https://huggingface.co/datasets/allenai/nllb/tree/main
I've stored the
NLLB_PAIRS
(pairs released with NLLB paper) andCCMATRIX_PAIRS
(pairs for which NLLB paper reused previous CCMatrix dataset) in a separate python file to import these variables into the main script because these two are rather big and could impair code readability.Resulting tree structure with flag
--minimal
(debug mode) enabled:downloads/nllb/ ├── ace_Latn-ban_Latn │ ├── allenai.nllb.ace_Latn │ └── allenai.nllb.ban_Latn └── amh_Ethi-nus_Latn ├── allenai.nllb.amh_Ethi └── allenai.nllb.nus_Latn
I'm planning on doing analysis as the next steps to this PR.