huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.7k stars 27.17k forks source link

Cannot load reformer-enwik8 tokenizer #4492

Closed erickrf closed 4 years ago

erickrf commented 4 years ago

🐛 Bug

Information

Model I am using (Bert, XLNet ...): Reformer tokenizer

To reproduce

Steps to reproduce the behavior:

  1. Try to load the pretrained reformer-enwik8 tokenizer with AutoTokenizer.from_pretrained("google/reformer-enwik8")

This is the error I got:

OSError                                   Traceback (most recent call last)
<ipython-input-51-ab9a64363cc0> in <module>
----> 1 AutoTokenizer.from_pretrained("google/reformer-enwik8")

~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    198                     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    199                 else:
--> 200                     return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    201 
    202         raise ValueError(

~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_utils.py in from_pretrained(cls, *inputs, **kwargs)
    896 
    897         """
--> 898         return cls._from_pretrained(*inputs, **kwargs)
    899 
    900     @classmethod

~/.virtualenvs/sparseref/lib/python3.7/site-packages/transformers-2.9.0-py3.7.egg/transformers/tokenization_utils.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1001                     ", ".join(s3_models),
   1002                     pretrained_model_name_or_path,
-> 1003                     list(cls.vocab_files_names.values()),
   1004                 )
   1005             )

OSError: Model name 'google/reformer-enwik8' was not found in tokenizers model name list (google/reformer-crime-and-punishment). We assumed 'google/reformer-enwik8' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

I tried with and without google/, same result. However, it did print the download progress bar. Trying to load the crime-and-punishment Reformer tokenizer works.

BramVanroy commented 4 years ago

Hi. This is not a bug but is expected: since the model works on the character level, a tokenizer is not "required". You can read more in the model card on how you can encode/decode your data.

bratao commented 4 years ago

@erickrf can you share how you got to train the "reformer" model. I´m trying to utilize the "google/reformer-enwik8" to train a Portuguese model but I just got the same error of Model name 'google/reformer-enwik8' was not found in tokenizers

BramVanroy commented 4 years ago

@bratao I answered this in my comment... Open the link thzt I posted and scroll down. They tell you how to do tokenisation. No need to load a tokenizer as usual.

LeopoldACC commented 3 years ago

@BramVanroy

my code is below

python examples/seq2seq/finetune_trainer.py --model_name_or_path google/reformer-enwik8 --do_train --do_eval --task translation_en_to_de --data_dir /lustre/dataset/wmt17_en_de/  --output_dir /home2/zhenggo1/checkpoint/reformer_translation --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate

and the bug is below,so what the reason? thks!

Traceback (most recent call last):
  File "examples/seq2seq/finetune_trainer.py", line 367, in <module>
    main()
  File "examples/seq2seq/finetune_trainer.py", line 206, in main
    cache_dir=model_args.cache_dir,
  File "/home2/zhenggo1/LowPrecisionInferenceTool/examples/pytorch/huggingface_transformers/src/transformers/models/auto/tokenization_auto.py", line 385, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home2/zhenggo1/LowPrecisionInferenceTool/examples/pytorch/huggingface_transformers/src/transformers/tokenization_utils_base.py", line 1760, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'google/reformer-enwik8'. Make sure that:

- 'google/reformer-enwik8' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'google/reformer-enwik8' is the correct path to a directory containing relevant tokenizer files
BramVanroy commented 3 years ago

@LeopoldACC Please post a new issue so that some one can have a look.