Hi @themanojkumar ,
I was trying to use BioGpt model in my QA task for fine-tuning. I would like to use the tokenizer as a fast tokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, when creating a BiogptTokenizerFast from the PreTrainedTokenizerFast by convert_slow_tokenizer, following error occurs: Error while initializing BPE: Token -@</w> out of vocabulary.
According to this issue https://github.com/huggingface/transformers/issues/9290, this problem might be caused by some missing tokens. Could you please check it? Thank you very much!
Environment
transformers version: 4.25.0
Error trace
Traceback (most recent call last):
File "run.py", line 124, in <module>
trainer, predict_dataset = get_trainer(args)
File "***/tasks/qa/get_trainer.py", line 31, in get_trainer
tokenizer = BioGptTokenizerFast.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 1801, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 1956, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "***/model/biogpt/tokenization_biogpt_fast.py", line 117, in __init__
super().__init__(
File "***/model/biogpt/tokenization_utils_fast.py", line 114, in __init__
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File "***/model/biogpt/convert_slow_tokenizer.py", line 1198, in convert_slow_tokenizer
return converter_class(transformer_tokenizer).converted()
File "***/model/biogpt/convert_slow_tokenizer.py", line 273, in converted
BPE(
Exception: Error while initializing BPE: Token `-@</w>` out of vocabulary
Hi @themanojkumar , I was trying to use BioGpt model in my QA task for fine-tuning. I would like to use the tokenizer as a fast tokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, when creating a BiogptTokenizerFast from the PreTrainedTokenizerFast by
convert_slow_tokenizer
, following error occurs: Error while initializing BPE: Token-@</w>
out of vocabulary. According to this issue https://github.com/huggingface/transformers/issues/9290, this problem might be caused by some missing tokens. Could you please check it? Thank you very much!Environment
transformers
version: 4.25.0Error trace