huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.07k stars 806 forks source link

no entry found for key error in tokenizer #997

Closed AbuUbaida closed 2 years ago

AbuUbaida commented 2 years ago

The tokenizer I am using: tokenizer = BertTokenizerFast.from_pretrained("sagorsarker/bangla-bert-base") with datasets v1.0.2 & transformers v4.2.1. Whenever I try to map the train data: train_data = train_data.map( process_data_to_model_inputs, batched=True, batch_size=batch_size, remove_columns=['text', 'summary'] ) it just throws the error no entry found for key, while no problem with other tokenizers. What entry is missing and where it should be added?

Narsil commented 2 years ago

Hi @AbuUbaida

What version of tokenizers are you using ? This error has been reduced to a warning I think (it's a hole within your vocabulary, which should be concerning, but you should still be able to run your model).

Can you try upgrading to 0.12.1 and a more recent transformers version to check ?

AbuUbaida commented 2 years ago

What version of tokenizers are you using ?

where can I get the version of the tokenizer I am using actually? (newbie) and if I want to stick with the current version of the transformer (4.2.1) what could be the way to solve the problem?

Narsil commented 2 years ago

pip freeze (look for the tokenizers line).

You don't have to upgrade transformers, but we're trying to have pretty nice backward compatibility, so it shouldn't be much of an issue (again you don't have to)

AbuUbaida commented 2 years ago

pip freeze (look for the tokenizers line).

First, thanks for this lesson, a powerful command indeed. My tokenizer version 0.9.4 has been upgraded to 0.12.1 as you said. However, this error is shown at the time of importing transformers: VersionConflict: tokenizers==0.9.4 is required for a normal functioning of this module, but found tokenizers==0.12.1. Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git master

If I don't irritate you, will I have to upgrade the transformer version?

Thanks for your support!

Narsil commented 2 years ago

To be safe yes, but you could also remove those triggers and see what happens.

The other route is to fix your vocabulary, or use the "slow" version AutoTokenizer.from_pretrained("....", use_fast=False)

AbuUbaida commented 2 years ago

It’s just working after mentioning explicitly datasets==2.1.0. Thanks for your contribution @Narsil. Closing the issue!