Closed stefan-it closed 2 years ago
For the tokenization part, SPM model is provided and can be downloaded from here.
It is a "real" SPM model, that can e.g. be loaded like this:
import sentencepiece as spm
model_file = "flores200sacrebleuspm"
sp_model = spm.SentencePieceProcessor()
sp_model.Load(model_file)
Let's investigate this model a bit more:
In [5]: sp_model.vocab_size()
Out[5]: 256000
In [6]: for index in range(0,10):
...: print(index, "->", sp_model.IdToPiece(index))
...:
0 -> <unk>
1 -> <s>
2 -> </s>
3 -> an
4 -> ▁n
5 -> ▁m
6 -> ▁t
7 -> ▁k
8 -> ▁a
9 -> ▁s
The overall vocab size is 256,000 and the output shows the first 10 "pieces" in the SPM model.
Hi, I'm one of the Meta engineers who worked on NLLB, and I'm happy to support this from our side. That's indeed the correct (real) SPM model for the vocabulary used for input/output, but internally the model's vocabulary (and embedding table) size is supplemented at the end by a token for each language, which happens here:
This list of languages come from an input arg which reads them from a string or file. For these particular models that value is:
Please let me know if you have any questions about this or if I can be of any further help.
i see their demo, for chinese translate, very low quality, i think they need do more hard working on improve the AI
I made a mistake above, as there is another way our internal vocabulary differs from the "standard" SPM model:
The 3 special tokens shown at the beginning of your output above are replaced by the following 4 tokens (at indices 0, 1, 2, and 3, respectively): "", "", "
This can be seen where the internal Fairseq dictionary is constructed in the code from the plaintext vocabulary file (before the language tokens are added):
@jhcross thanks for the explanation. I think we need to perform some fairseq-mapping, as e.g. done in the XLM-R or BART tokenizer:
@stefan-it that makes sense, and I would assume that that code could be reused verbatim. The only additional thing would be add the language tokens to the end of the vocabulary. Note that the language list can also be extracted from the checkpoint data as follows:
checkpoint = torch.load(path_to_file)
langs_list = checkpoint["cfg"]["model"].langs
Thanks for opening an issue! We've managed to convert the models to the M2M_100 architecture and the tokenizers to a new NLLB tokenizer very closely resembling that of the mBART tokenizer.
We're in the process of testing all models for generation and performance and I'll likely open a PR in a few hours.
Hi, I'm one of the Meta engineers who worked on NLLB, and I'm happy to support this from our side. That's indeed the correct (real) SPM model for the vocabulary used for input/output, but internally the model's vocabulary (and embedding table) size is supplemented at the end by a token for each language, which happens here:
This list of languages come from an input arg which reads them from a string or file. For these particular models that value is:
Please let me know if you have any questions about this or if I can be of any further help.
Hello, First of all Thank you so much. nllb is super power translation! There are 208 in the world, and I think it's amazing to translate 200 of them into languages. Also, thank you so much for updating the hugging face 5 days ago so that it can be easily used.
I 'm trying to using huggingface-nllb. But, tokenizer is not working...
----> 1 tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
[/usr/local/lib/python3.7/dist-packages/transformers/models/auto/tokenization_auto.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
575 if tokenizer_class is None:
576 raise ValueError(
--> 577 f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
578 )
579 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
ValueError: Tokenizer class NllbTokenizer does not exist or is not currently imported.
Question1 As an nllb beginer I don't know how to fix this.
Question2 And if the tokenizer works, can I use it in the same way as the M2M100 model?
Hello, #18126 Really, Really Tokenizer is not working.
Hey @daje0601, it was just merged to the main branch 15 minutes ago, I just tried it, and it seems to be working, make sure you are installing the main branch.
Hey @daje0601, it was just merged to the main branch 15 minutes ago, I just tried it, and it seems to be working, make sure you are installing the main branch.
Hey @AhmedIdr, I'm not a liar. I also tried running it in colab a minute ago. But it didn't work. So I asked this question. I've been thinking about this all day today. I knew it was a really simple question, so I searched and searched more and asked the question. Here is a link to a colab I tested.link
Hey @daje0601, you are installing transformers from pip and not installing the latest branch on Github. Try installing transformers like this !pip install git+https://github.com/huggingface/transformers.git
and see if it does work afterwards.
Hey @daje0601, you are installing transformers from pip and not installing the latest branch on Github. Try installing transformers like this
!pip install git+https://github.com/huggingface/transformers.git
and see if it does work afterwards.
Oh..!!!!!!!!!! It's working..!!!! So So So Thank you ♥︎
@AhmedIdr Hi, except the NLLB models itself, authors have also published their language identification model. Is there a chance for having it incorporated to the hf as well?
@ArturPrzybysz Hi, I am not a part of the hf team, I am just a community member and wanted to help with the issue :)
@ArturPrzybysz You can use the LID (Language IDentification) model with fastText https://github.com/huggingface/transformers/issues/18294#issuecomment-1207374838
First and foremost, thank you to everyone that has been working on this, both from the original team at Meta and then on porting it to huggingface.
I was checking the model's page in the huggingface website. Unlike previous translation models (like mBART) there are no details regarding how to train a model with the NLLB architecture in new languages. I am especially interested on the details regarding the Load Balancing Loss function: how to compute it, combine it with the standard Cross Entropy Loss, and back propagate it properly.
I would be very thankful if anyone can point me in the right direction concerning this topic.
Model description
Hi,
Meta recently released another cool project called "No Language Left Behind" (NLLB):
The project itself is integrated into
fairseq
library and available on thenllb
branch:https://github.com/facebookresearch/fairseq/tree/nllb
It includes code release as well as released checkpoints.
A detailed 190 page paper is also available from here.
We should really add support for these amazing project by adding support for NLLB.
Open source status
Provide useful links for the implementation
Models checkpoint are available here:
Maintainers are: @vedanuj, @shruti-bh, @annasun28, @elbayadm, @jeanm, @jhcross, @kauterry and @huihuifan.
Implementation is available in the
fairseq
repo: https://github.com/facebookresearch/fairseq/tree/nllb