huggingface / transformers

đŸ¤— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.62k stars 25.74k forks source link

ValueError: Couldn't instantiate the backend tokenizer while loading model tokenizer #9750

Closed rsanjaykamath closed 3 years ago

rsanjaykamath commented 3 years ago

Environment info

Who can help

@mfuntowicz @patrickvonplaten

Information

Model I am using (Bert, XLNet ...): T5 The problem arises when using:

To reproduce

Steps to reproduce the behavior:

  1. Follow the instructions here https://github.com/allenai/unifiedqa to get the sample code
  2. Copy paste it in Colab to run it.
from transformers import AutoTokenizer, T5ForConditionalGeneration

model_name = "allenai/unifiedqa-t5-small" # you can specify the model size here
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    res = model.generate(input_ids, **generator_args)
    return tokenizer.batch_decode(res, skip_special_tokens=True)

Expected behavior

The following code should load the model without errors.

Error

But the following error is obtained:

ValueError                                Traceback (most recent call last)
<ipython-input-4-ee10e1c1c77e> in <module>()
      2 
      3 model_name = "allenai/unifiedqa-t5-small" # you can specify the model size here
----> 4 tokenizer = AutoTokenizer.from_pretrained(model_name)
      5 model = T5ForConditionalGeneration.from_pretrained(model_name)
      6 

4 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py in __init__(self, *args, **kwargs)
     94         else:
     95             raise ValueError(
---> 96                 "Couldn't instantiate the backend tokenizer from one of: "
     97                 "(1) a `tokenizers` library serialization file, "
     98                 "(2) a slow tokenizer instance to convert or "

ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a `tokenizers` library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
patrickvonplaten commented 3 years ago

Hey @rsanjaykamath,

I cannot reproduce the error on master. When running:

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_name = "allenai/unifiedqa-t5-small" # you can specify the model size here
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

I don't encounter any errors...could you try to update transformers to the newest version and try again?

rsanjaykamath commented 3 years ago

Hi @patrickvonplaten ,

That's strange. I just tried it on Colab with the version 4.2.2 of transformers and the same error occurs again. Have you tried it on colab? or local machine?

patrickvonplaten commented 3 years ago

I see it's the classic sentencepiece error - I should have better read your error message ;-)

Here the colab to show how it works: https://colab.research.google.com/drive/1QybYdj-1bW0MHD0cutWBPWas5IFEhSjC?usp=sharing

patrickvonplaten commented 3 years ago

Also see: https://github.com/huggingface/transformers/issues/8963

rsanjaykamath commented 3 years ago

Ok got it. Installing sentencepiece and restarting the kernel did the trick for me.

Thanks for your help :) Closing the issue.

NourEldin-Osama commented 1 year ago

I think the error message should be more clear

trexanhvnn commented 9 months ago

I see it's the classic sentencepiece error - I should have better read your error message ;-)

Here the colab to show how it works: https://colab.research.google.com/drive/1QybYdj-1bW0MHD0cutWBPWas5IFEhSjC?usp=sharing

image

pb6192 commented 2 months ago

In case it helps someone...I got this error because I had a corrupted or missing file in the Llama3 model. I downloaded it again and it fixed it.