`fill-mask` pipeline cannot load tokenizer's `config.json` (fixed in 4.8.0)

rspreafico-absci commented 3 years ago

Environment info

transformers version: 4.7.0
Platform: Linux-5.4.0-74-generic-x86_64-with-glibc2.31
Python version: 3.9.5
PyTorch version (GPU?): 1.9.0+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@sgugger @LysandreJik

Information

Model I am using: RoBERTa

The problem arises when using:

[ ] the official example scripts: (give details below)
[X] my own modified scripts: see details below

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X] my own task or dataset: see details below

To reproduce

Following official notebook to train from scratch RoBERTa (tokenizer and model alike). The only addition is to save the RoBERTa tokenizer

tokenizer = RobertaTokenizerFast.from_pretrained("/path/to/BPE/tokenizer", return_special_tokens_mask=True, model_max_length=32)  # BPE tokenizer previously trained using the tokenizer library, as per docs, then vocab and merges loaded from transfromers' RobertaTokenizerFast

tokenizer.save_pretrained("/path/to/roberta_tk")  # resaving the tokenizer, full model now

Saving outputs the following:

('/path/to/roberta_tk/tokenizer_config.json',
 '/path/to/roberta_tk/special_tokens_map.json',
 '/path/to/roberta_tk/vocab.json',
 '/path/to/roberta_tk/merges.txt',
 '/path/to/roberta_tk/added_tokens.json',
 '/path/to/roberta_tk/tokenizer.json')

Note that there is no config.json file, only tokenizer_config.json

Then try to load the tokenizer:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer="/path/to/roberta_tk"
)

Errors out, complaining that config.json is missing. Symlinking tokenizer_config.json to config.json solves the issues.

Expected behavior

File name match between tokenizer save output and pipeline input.

sgugger commented 3 years ago

The config it asks for is the model config, not the tokenizer config. The fact the tokenizer can be loaded independently of the model has been fixed recently, so you should try on a source install.

rspreafico-absci commented 3 years ago

I will try with a source install, however the error message says that the config.json file is missing from the file path specified with the tokenizer parameter, not from the file path specified with the model argument. My bad that I didn't report the full error message before, here it is:

OSError: Can't load config for '/nfs/home/rspreafico/workspace/models/v1/tokenizer/roberta'. Make sure that:

- '/nfs/home/rspreafico/workspace/models/v1/tokenizer/roberta' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/nfs/home/rspreafico/workspace/models/v1/tokenizer/roberta' is the correct path to a directory containing a config.json file

sgugger commented 3 years ago

Yes, that was the bug: the tokenizer required to have the model saved in the same directory to be reloaded in a pipeline.

rspreafico-absci commented 3 years ago

Gotcha, thank you!

rspreafico-absci commented 3 years ago

I cloned the transformers repo as of 5 min ago and installed from source, but I am getting the same error message. transformers-cli env confirms that I am using the dev version of transformers:

- `transformers` version: 4.8.0.dev0
- Platform: Linux-5.4.0-74-generic-x86_64-with-glibc2.31
- Python version: 3.9.5
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

sgugger commented 3 years ago

I'm trying to reproduce but it all works fine on my end. Since I don't have your model and tokenizer, here is the code I execute:

from transformers import RobertaTokenizerFast, RobertaForMaskedLM, pipeline

tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
tokenizer.save_pretrained("test-tokenizer") # Only the tokenizer files are saved here

model = RobertaForMaskedLM.from_pretrained("roberta-base")
model.save_pretrained("test-model") # Only the model files are saved there

fill_mask = pipeline(
    "fill-mask",
    model="test-model",
    tokenizer="test-tokenizer",
)

fill_mask("My <mask> is Sylvain.")

rspreafico-absci commented 3 years ago

Ok, found it.

I was merely re-running fill_mask = pipeline(...) upon installing the dev version of transformers. This is insufficient to get rid of the error.

Conversely, I needed to re-run the whole notebook, most crucially tokenizer.save_pretrained(...). In 4.8.0.dev0 this adds an additional field to tokenizer_config.json which is missing in 4.7.0, namely "tokenizer_class": "RobertaTokenizer". Without this field (either because the tokenizer was saved with 4.7.0 or because one manually removes it from a file generated with 4.8.0.dev0), the error message pops up.

Thanks for looking into this!

sgugger commented 3 years ago

Ah yes, good analysis!

LysandreJik commented 3 years ago

@rspreafico-absci FYI there was an issue with the fill-mask pipeline with the targets argument on master recently, so if you're running on a source installation I suggest to update it to a more recent version

rspreafico-absci commented 3 years ago

Thanks @LysandreJik ! I saw that the official 4.8.0 was released yesterday, so I switched to using the PyPI version now. Can you confirm that 4.8.0 on PyPI is ok to use? Thank you.

LysandreJik commented 3 years ago

Version v4.8.0 on PyPi is indeed ok to use and should work perfectly well for the fill-mask pipeline. :)

rayenebech commented 3 years ago

In my program the fill-mask is requiring the tokenizer_config.json file. However when I run tokenizer.save_model I only get 2 files: vocab.json and merges.txt for my own ByteLevelBPETokenizer. How can I generate automatically the tokenizer_config.json file?

Ehsan1997 commented 2 years ago

For anyone stumbling here because their tokenizer only saved vocab.config and merges.txt, you need to load your tokenizer and pass it instead of the config.

pipeline(args..., tokenizer=TokenizerClass.from_pretrained('path_to_saved_files'))

huggingface / transformers