microsoft / BioGPT

MIT License
4.27k stars 445 forks source link

Unable to convert BioGpt slow tokenizer to fast: token out of vocabulary #75

Open seantaud opened 1 year ago

seantaud commented 1 year ago

Hi @themanojkumar , I was trying to use BioGpt model in my QA task for fine-tuning. I would like to use the tokenizer as a fast tokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, when creating a BiogptTokenizerFast from the PreTrainedTokenizerFast by convert_slow_tokenizer, following error occurs: Error while initializing BPE: Token -@</w> out of vocabulary. According to this issue https://github.com/huggingface/transformers/issues/9290, this problem might be caused by some missing tokens. Could you please check it? Thank you very much!

Environment

transformers version: 4.25.0

Error trace

Traceback (most recent call last):
  File "run.py", line 124, in <module>
    trainer, predict_dataset = get_trainer(args)
  File "***/tasks/qa/get_trainer.py", line 31, in get_trainer
    tokenizer = BioGptTokenizerFast.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 1801, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 1956, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "***/model/biogpt/tokenization_biogpt_fast.py", line 117, in __init__
    super().__init__(
  File "***/model/biogpt/tokenization_utils_fast.py", line 114, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "***/model/biogpt/convert_slow_tokenizer.py", line 1198, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "***/model/biogpt/convert_slow_tokenizer.py", line 273, in converted
    BPE(
Exception: Error while initializing BPE: Token `-@</w>` out of vocabulary
seantaud commented 1 year ago

Colab code for reproduction:

https://colab.research.google.com/drive/1IMhiDz45GiarBLgXG9B2rA_u0ZOmmjJS?usp=sharing

TekeshwarHirwani commented 1 year ago

I am also facing same problem, Do you have any update ?