Closed rspreafico-absci closed 3 years ago
The config it asks for is the model config, not the tokenizer config. The fact the tokenizer can be loaded independently of the model has been fixed recently, so you should try on a source install.
I will try with a source install, however the error message says that the config.json
file is missing from the file path specified with the tokenizer
parameter, not from the file path specified with the model
argument. My bad that I didn't report the full error message before, here it is:
OSError: Can't load config for '/nfs/home/rspreafico/workspace/models/v1/tokenizer/roberta'. Make sure that:
- '/nfs/home/rspreafico/workspace/models/v1/tokenizer/roberta' is a correct model identifier listed on 'https://huggingface.co/models'
- or '/nfs/home/rspreafico/workspace/models/v1/tokenizer/roberta' is the correct path to a directory containing a config.json file
Yes, that was the bug: the tokenizer required to have the model saved in the same directory to be reloaded in a pipeline.
Gotcha, thank you!
I cloned the transformers
repo as of 5 min ago and installed from source, but I am getting the same error message. transformers-cli env
confirms that I am using the dev
version of transformers
:
- `transformers` version: 4.8.0.dev0
- Platform: Linux-5.4.0-74-generic-x86_64-with-glibc2.31
- Python version: 3.9.5
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
I'm trying to reproduce but it all works fine on my end. Since I don't have your model and tokenizer, here is the code I execute:
from transformers import RobertaTokenizerFast, RobertaForMaskedLM, pipeline
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
tokenizer.save_pretrained("test-tokenizer") # Only the tokenizer files are saved here
model = RobertaForMaskedLM.from_pretrained("roberta-base")
model.save_pretrained("test-model") # Only the model files are saved there
fill_mask = pipeline(
"fill-mask",
model="test-model",
tokenizer="test-tokenizer",
)
fill_mask("My <mask> is Sylvain.")
Ok, found it.
I was merely re-running fill_mask = pipeline(...)
upon installing the dev version of transformers. This is insufficient to get rid of the error.
Conversely, I needed to re-run the whole notebook, most crucially tokenizer.save_pretrained(...)
. In 4.8.0.dev0
this adds an additional field to tokenizer_config.json
which is missing in 4.7.0
, namely "tokenizer_class": "RobertaTokenizer"
. Without this field (either because the tokenizer was saved with 4.7.0
or because one manually removes it from a file generated with 4.8.0.dev0
), the error message pops up.
Thanks for looking into this!
Ah yes, good analysis!
@rspreafico-absci FYI there was an issue with the fill-mask pipeline with the targets
argument on master
recently, so if you're running on a source installation I suggest to update it to a more recent version
Thanks @LysandreJik ! I saw that the official 4.8.0 was released yesterday, so I switched to using the PyPI version now. Can you confirm that 4.8.0 on PyPI is ok to use? Thank you.
Version v4.8.0 on PyPi is indeed ok to use and should work perfectly well for the fill-mask pipeline. :)
In my program the fill-mask
is requiring the tokenizer_config.json file. However when I run tokenizer.save_model
I only get 2 files: vocab.json and merges.txt for my own ByteLevelBPETokenizer
. How can I generate automatically the tokenizer_config.json file?
For anyone stumbling here because their tokenizer only saved vocab.config and merges.txt, you need to load your tokenizer and pass it instead of the config.
pipeline(args..., tokenizer=TokenizerClass.from_pretrained('path_to_saved_files'))
Environment info
transformers
version: 4.7.0Who can help
@sgugger @LysandreJik
Information
Model I am using: RoBERTa
The problem arises when using:
The tasks I am working on is:
To reproduce
Following official notebook to train from scratch RoBERTa (tokenizer and model alike). The only addition is to save the RoBERTa tokenizer
Saving outputs the following:
Note that there is no
config.json
file, onlytokenizer_config.json
Then try to load the tokenizer:
Errors out, complaining that
config.json
is missing. Symlinkingtokenizer_config.json
toconfig.json
solves the issues.Expected behavior
File name match between tokenizer save output and pipeline input.