Closed ryparmar closed 3 years ago
There seems to be some tokens missing/non-consecutive tokens in the vocabulary of that tokenizer, causing the serialization to fail
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
The issue is still there. Is anyone able to save the tokenizer with: tokenizer.save_pretrained() ?
Facing the exact problem while working with "sagorsarker/bangla-bert-base" using the same reproduce instruction provided by @ryparmar. And still could not solve this issue, even the root cause of this error.
I found the the problem is from using fast tokenizer. so I turned it of using flag --use_fast_tokenizer=False
, and it is ok. Though it is not solution I want.
Hey guys,
At the moment, it seems like we will have to fall back to the slow tokenizer for this one:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base", use_fast=False)
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
tokenizer.save_pretrained('./')
works.
Hi all,
I just committed a working fast tokenizer to the HF ufal/robeczech-base
repository, in case it helps someone (but loading a fast tokenizer from the previous repository content was working too).
The reason why it cannot be saved is our own mistake (the authors of the ufal/robeczech-base
model). During training, the subwords not present in the training data were left out from the dictionary, but ByteBPE requires the basic 256 subwords representing the 256 byte value, and some of the were left out. We therefore have multiple subwords mapped to id 3 (the id of [UNK]
token), which seems to be working fine during loading, but not during saving (only one subword with id 3 is saved).
Sorry for the trouble...
Information
Unable to save 'ufal/robeczech-base' fast tokenizer, which is a variation of roberta. I have tried the same minimal example (see below) with non-fast tokenizer and it worked fine.
Error message with a
RUST_BACKTRACE=1
:Environment info
transformers
version: 4.10.0Who can help
@patrickvonplaten, @LysandreJik.
To reproduce
Import model and tokenizer:
Save the tokenizer: