Non-JSON-serializable tokenizer config with `save_pretrained`

vinbo8 commented 3 years ago

Environment info

transformers version: 4.3.1
Platform: Linux
Python version: 3.7.9
PyTorch version (GPU?): 1.7.1 (GPU)
Tensorflow version (GPU?): 2.1.2 (GPU)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...): Bert

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

Using a minimal example with loading/saving a tokenizer.

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

Again, this is just a minimal example.

To reproduce

Steps to reproduce the behavior:

Instantiate a BertConfig and a BertTokenizer based on the config.
Try and save the tokenizer with save_pretrained

Minimal example:

from transformers import BertConfig, BertTokenizer

config = BertConfig.from_pretrained("./configs/bert-small.json", cache_dir=".")
tokenizer = BertTokenizer.from_pretrained("vocab/", cache_dir=".", config=config)
tokenizer.save_pretrained('new_save')

Error:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    tokenizer.save_pretrained('new_save')
  File "/cluster/envs/mult/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1979, in save_pretrained
    f.write(json.dumps(tokenizer_config, ensure_ascii=False))
  File "/cluster/envs/mult/lib/python3.7/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/cluster/envs/mult/lib/python3.7/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/cluster/envs/mult/lib/python3.7/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/cluster/envs/mult/lib/python3.7/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type BertConfig is not JSON serializable

Expected behavior

Tokenizer should be saveable. I'm guessing this could be happening because the bit that's supposed to be saving the config is using the json library directly, instead of calling to_json_file on the BertConfig, but I'm not sure.

patil-suraj commented 3 years ago

Hi @vin-ivar

The tokenizer does not need the model config file, there is no need to pass it when initializing the tokenizer.

vinbo8 commented 3 years ago

That fixes it, I was using an older script without taking that bit out.

huggingface / transformers