Add Support for "No Language Left Behind" (NLLB)

stefan-it commented 2 years ago

Model description

Hi,

Meta recently released another cool project called "No Language Left Behind" (NLLB):

No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering high-quality translations directly between any pair of 200+ languages — including low-resource languages like Asturian, Luganda, Urdu and more. It aims to help people communicate with anyone, anywhere, regardless of their language preferences.

The project itself is integrated into fairseq library and available on the nllb branch:

https://github.com/facebookresearch/fairseq/tree/nllb

It includes code release as well as released checkpoints.

A detailed 190 page paper is also available from here.

We should really add support for these amazing project by adding support for NLLB.

Open source status

[x] The model implementation is available
[x] The model weights are available

Provide useful links for the implementation

Models checkpoint are available here:

Model Name	Model Type	#params	checkpoint	metrics
NLLB-200	MoE	54.5B	model	metrics
NLLB-200	Dense	3.3B	model	metrics
NLLB-200	Dense	1.3B	model	metrics
NLLB-200-Distilled	Dense	1.3B	model	metrics
NLLB-200-Distilled	Dense	600M	model	metrics

Maintainers are: @vedanuj, @shruti-bh, @annasun28, @elbayadm, @jeanm, @jhcross, @kauterry and @huihuifan.

Implementation is available in the fairseq repo: https://github.com/facebookresearch/fairseq/tree/nllb

stefan-it commented 2 years ago

For the tokenization part, SPM model is provided and can be downloaded from here.

It is a "real" SPM model, that can e.g. be loaded like this:

import sentencepiece as spm

model_file = "flores200sacrebleuspm"
sp_model = spm.SentencePieceProcessor()
sp_model.Load(model_file)

Let's investigate this model a bit more:


In [5]: sp_model.vocab_size()
Out[5]: 256000

In [6]: for index in range(0,10):
   ...:     print(index, "->", sp_model.IdToPiece(index))
   ...: 
0 -> <unk>
1 -> <s>
2 -> </s>
3 -> an
4 -> ▁n
5 -> ▁m
6 -> ▁t
7 -> ▁k
8 -> ▁a
9 -> ▁s

The overall vocab size is 256,000 and the output shows the first 10 "pieces" in the SPM model.

jhcross commented 2 years ago

Hi, I'm one of the Meta engineers who worked on NLLB, and I'm happy to support this from our side. That's indeed the correct (real) SPM model for the vocabulary used for input/output, but internally the model's vocabulary (and embedding table) size is supplemented at the end by a token for each language, which happens here:

https://github.com/facebookresearch/fairseq/blob/26d62ae8fbf3deccf01a138d704be1e5c346ca9a/fairseq/data/multilingual/multilingual_utils.py#L64

This list of languages come from an input arg which reads them from a string or file. For these particular models that value is:

https://github.com/facebookresearch/fairseq/blob/26d62ae8fbf3deccf01a138d704be1e5c346ca9a/examples/nllb/modeling/scripts/flores200/langs.txt#L1

Please let me know if you have any questions about this or if I can be of any further help.

ghost commented 2 years ago

i see their demo, for chinese translate, very low quality, i think they need do more hard working on improve the AI

jhcross commented 2 years ago

I made a mistake above, as there is another way our internal vocabulary differs from the "standard" SPM model:

The 3 special tokens shown at the beginning of your output above are replaced by the following 4 tokens (at indices 0, 1, 2, and 3, respectively): "~~", "",~~ ", "".

This can be seen where the internal Fairseq dictionary is constructed in the code from the plaintext vocabulary file (before the language tokens are added):

https://github.com/fairinternal/fairseq-py/blob/3506ddfb3585aa470f59902ea44625e39287e37c/fairseq/data/dictionary.py#L35-L38

stefan-it commented 2 years ago

@jhcross thanks for the explanation. I think we need to perform some fairseq-mapping, as e.g. done in the XLM-R or BART tokenizer:

https://github.com/huggingface/transformers/blob/3f936df66287f557c6528912a9a68d7850913b9b/src/transformers/models/mbart/tokenization_mbart.py#L129-L136

jhcross commented 2 years ago

@stefan-it that makes sense, and I would assume that that code could be reused verbatim. The only additional thing would be add the language tokens to the end of the vocabulary. Note that the language list can also be extracted from the checkpoint data as follows:

checkpoint = torch.load(path_to_file)
langs_list = checkpoint["cfg"]["model"].langs

LysandreJik commented 2 years ago

Thanks for opening an issue! We've managed to convert the models to the M2M_100 architecture and the tokenizers to a new NLLB tokenizer very closely resembling that of the mBART tokenizer.

We're in the process of testing all models for generation and performance and I'll likely open a PR in a few hours.

daje0601 commented 2 years ago

Hi, I'm one of the Meta engineers who worked on NLLB, and I'm happy to support this from our side. That's indeed the correct (real) SPM model for the vocabulary used for input/output, but internally the model's vocabulary (and embedding table) size is supplemented at the end by a token for each language, which happens here:

https://github.com/facebookresearch/fairseq/blob/26d62ae8fbf3deccf01a138d704be1e5c346ca9a/fairseq/data/multilingual/multilingual_utils.py#L64

This list of languages come from an input arg which reads them from a string or file. For these particular models that value is:

https://github.com/facebookresearch/fairseq/blob/26d62ae8fbf3deccf01a138d704be1e5c346ca9a/examples/nllb/modeling/scripts/flores200/langs.txt#L1

Please let me know if you have any questions about this or if I can be of any further help.

Hello, First of all Thank you so much. nllb is super power translation! There are 208 in the world, and I think it's amazing to translate 200 of them into languages. Also, thank you so much for updating the hugging face 5 days ago so that it can be easily used.

I 'm trying to using huggingface-nllb. But, tokenizer is not working...

----> 1 tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

[/usr/local/lib/python3.7/dist-packages/transformers/models/auto/tokenization_auto.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    575             if tokenizer_class is None:
    576                 raise ValueError(
--> 577                     f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
    578                 )
    579             return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)

ValueError: Tokenizer class NllbTokenizer does not exist or is not currently imported.

Question1 As an nllb beginer I don't know how to fix this.
Question2 And if the tokenizer works, can I use it in the same way as the M2M100 model?

daje0601 commented 2 years ago

Hello, #18126 Really, Really Tokenizer is not working.

AhmedIdr commented 2 years ago

Hey @daje0601, it was just merged to the main branch 15 minutes ago, I just tried it, and it seems to be working, make sure you are installing the main branch.

daje0601 commented 2 years ago

Hey @daje0601, it was just merged to the main branch 15 minutes ago, I just tried it, and it seems to be working, make sure you are installing the main branch.

Hey @AhmedIdr, I'm not a liar. I also tried running it in colab a minute ago. But it didn't work. So I asked this question. I've been thinking about this all day today. I knew it was a really simple question, so I searched and searched more and asked the question. Here is a link to a colab I tested.link

AhmedIdr commented 2 years ago

Hey @daje0601, you are installing transformers from pip and not installing the latest branch on Github. Try installing transformers like this !pip install git+https://github.com/huggingface/transformers.git and see if it does work afterwards.

daje0601 commented 2 years ago

Hey @daje0601, you are installing transformers from pip and not installing the latest branch on Github. Try installing transformers like this !pip install git+https://github.com/huggingface/transformers.git and see if it does work afterwards.

Oh..!!!!!!!!!! It's working..!!!! So So So Thank you ♥︎

ArturPrzybysz commented 2 years ago

@AhmedIdr Hi, except the NLLB models itself, authors have also published their language identification model. Is there a chance for having it incorporated to the hf as well?

AhmedIdr commented 2 years ago

@ArturPrzybysz Hi, I am not a part of the hf team, I am just a community member and wanted to help with the issue :)

xia0nan commented 2 years ago

@ArturPrzybysz You can use the LID (Language IDentification) model with fastText https://github.com/huggingface/transformers/issues/18294#issuecomment-1207374838

sotwi commented 2 years ago

First and foremost, thank you to everyone that has been working on this, both from the original team at Meta and then on porting it to huggingface.

I was checking the model's page in the huggingface website. Unlike previous translation models (like mBART) there are no details regarding how to train a model with the NLLB architecture in new languages. I am especially interested on the details regarding the Load Balancing Loss function: how to compute it, combine it with the standard Cross Entropy Loss, and back propagate it properly.

I would be very thankful if anyone can point me in the right direction concerning this topic.

huggingface / transformers