TypeError: __init__() got an unexpected keyword argument 'vocab_file' in transformers/tokenization_gpt2.py", line 380

memray commented 4 years ago

Environment info

transformers version: 3.3.1
tokenizers version: 0.9.2
Platform: Linux-3.10.0-1062.4.1.el7.x86_64-x86_64-with-redhat-7.7-Maipo
Python version: 3.7.6
PyTorch version (GPU?): 1.6.0 (False)
Tensorflow version (GPU?): 2.3.1 (False)
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help @mfuntowicz

Information

Model I am using (Bert, XLNet ...): RoBERTa-base

The problem arises when using:

[ ] the official example scripts: (give details below)
[x ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below) fairseq

To reproduce

I use the RobertaTokenizerFast and it seems an arg name mismatch. Steps to reproduce the behavior:

self.tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', cache_dir=args.cache_dir)

In transformers.tokenization_gpt2.py L376 it is: ByteLevelBPETokenizer( vocab_file=vocab_file, merges_file=merges_file, add_prefix_space=add_prefix_space, trim_offsets=trim_offsets, )

But in tokenizers.implementations.ByteLevelBPETokenizer it is expected to be vocab.

Expected behavior

File "/zfs1/hdaqing/rum20/kp/fairseq-kpg/fairseq/data/encoders/hf_bpe.py", line 31, in __init__ self.tokenizer = RobertaTokenizerFast.from_pretrained(args.pretrained_model, cache_dir=args.cache_dir) File "/ihome/hdaqing/rum20/anaconda3/envs/kp/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1428, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File "/ihome/hdaqing/rum20/anaconda3/envs/kp/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1575, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/ihome/hdaqing/rum20/anaconda3/envs/kp/lib/python3.7/site-packages/transformers/tokenization_roberta.py", line 380, in __init__ **kwargs, File "/ihome/hdaqing/rum20/anaconda3/envs/kp/lib/python3.7/site-packages/transformers/tokenization_gpt2.py", line 380, in __init__ trim_offsets=trim_offsets, TypeError: __init__() got an unexpected keyword argument 'vocab_file'

azamatolegen commented 4 years ago

same issue

LysandreJik commented 4 years ago

Hello! I think this is due to a mismatch between your transformers and tokenizers versions. transformers version v3.3.1 expects tokenizers == 0.8.1.rc2.

If you want to use tokenizers == 0.9.2 you should work on the current master branch or wait for version v3.4.0 which should be released sometimes today.

memray commented 4 years ago

Thank you! I upgraded both and it works.

huggingface / transformers