Closed AbuUbaida closed 2 years ago
Hi @AbuUbaida
What version of tokenizers
are you using ? This error has been reduced to a warning I think (it's a hole within your vocabulary, which should be concerning, but you should still be able to run your model).
Can you try upgrading to 0.12.1
and a more recent transformers
version to check ?
What version of
tokenizers
are you using ?
where can I get the version of the tokenizer I am using actually? (newbie) and if I want to stick with the current version of the transformer (4.2.1) what could be the way to solve the problem?
pip freeze
(look for the tokenizers
line).
You don't have to upgrade transformers
, but we're trying to have pretty nice backward compatibility, so it shouldn't be much of an issue (again you don't have to)
pip freeze (look for the tokenizers line).
First, thanks for this lesson, a powerful command indeed. My tokenizer version 0.9.4 has been upgraded to 0.12.1 as you said. However, this error is shown at the time of importing transformers
:
VersionConflict: tokenizers==0.9.4 is required for a normal functioning of this module, but found tokenizers==0.12.1. Try: pip install transformers -U or pip install -e '.[dev]' if you're working with git master
If I don't irritate you, will I have to upgrade the transformer
version?
Thanks for your support!
To be safe yes, but you could also remove those triggers and see what happens.
The other route is to fix your vocabulary, or use the "slow" version AutoTokenizer.from_pretrained("....", use_fast=False)
It’s just working after mentioning explicitly datasets==2.1.0
. Thanks for your contribution @Narsil.
Closing the issue!
The tokenizer I am using:
tokenizer = BertTokenizerFast.from_pretrained("sagorsarker/bangla-bert-base")
with datasets v1.0.2 & transformers v4.2.1. Whenever I try to map the train data:train_data = train_data.map( process_data_to_model_inputs, batched=True, batch_size=batch_size, remove_columns=['text', 'summary'] )
it just throws the errorno entry found for key
, while no problem with other tokenizers. What entry is missing and where it should be added?