huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.41k stars 27.1k forks source link

Non-JSON-serializable tokenizer config with `save_pretrained` #10108

Closed vinbo8 closed 3 years ago

vinbo8 commented 3 years ago

Environment info

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...): Bert

The problem arises when using:

Using a minimal example with loading/saving a tokenizer.

The tasks I am working on is:

Again, this is just a minimal example.

To reproduce

Steps to reproduce the behavior:

  1. Instantiate a BertConfig and a BertTokenizer based on the config.
  2. Try and save the tokenizer with save_pretrained

Minimal example:

from transformers import BertConfig, BertTokenizer

config = BertConfig.from_pretrained("./configs/bert-small.json", cache_dir=".")
tokenizer = BertTokenizer.from_pretrained("vocab/", cache_dir=".", config=config)
tokenizer.save_pretrained('new_save')

Error:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    tokenizer.save_pretrained('new_save')
  File "/cluster/envs/mult/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1979, in save_pretrained
    f.write(json.dumps(tokenizer_config, ensure_ascii=False))
  File "/cluster/envs/mult/lib/python3.7/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/cluster/envs/mult/lib/python3.7/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/cluster/envs/mult/lib/python3.7/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/cluster/envs/mult/lib/python3.7/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type BertConfig is not JSON serializable

Expected behavior

Tokenizer should be saveable. I'm guessing this could be happening because the bit that's supposed to be saving the config is using the json library directly, instead of calling to_json_file on the BertConfig, but I'm not sure.

patil-suraj commented 3 years ago

Hi @vin-ivar

The tokenizer does not need the model config file, there is no need to pass it when initializing the tokenizer.

vinbo8 commented 3 years ago

That fixes it, I was using an older script without taking that bit out.