huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.74k stars 26.95k forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 49: invalid start byte #8757

Closed singhsidhukuldeep closed 3 years ago

singhsidhukuldeep commented 3 years ago

Environment info

Who can help

I think: @patrickvonplaten @LysandreJik @VictorSanh

Anyone is welcome!

Information

I am using examples/language-modeling/run_mlm_wwm.py to train my own Tiny BERT model.

To reproduce

Using Tiny BERT from Google https://github.com/google-research/bert/blob/master/README.md Using examples/language-modeling/run_mlm_wwm.py from HuggingFace to train a language model on raw text.

files in my google-bert-tiny are bert_config.json bert_model.ckpt.data-00000-of-00001 bert_model.ckpt.index vocab.txt

Steps to reproduce the behavior:

  1. install transformers torch and Tensorflow using pip
  2. Get examples/language-modeling/run_mlm_wwm.py from HuggingFace>Transformers Link
  3. Running the following command:
    python run_mlm_wwm.py \
    --model_name_or_path google-bert-tiny/bert_model.ckpt.index \
    --config_name google-bert-tiny/bert_config.json \
    --train_file train.txt \
    --validation_file val.txt \
    --do_train \
    --do_eval \
    --output_dir test-mlm-wwm \
    --cache_dir cache

Error:

Traceback (most recent call last):
  File "run_mlm_wwm.py", line 340, in <module>
    main()
  File "run_mlm_wwm.py", line 236, in main
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/tokenization_auto.py", line 306, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/configuration_auto.py", line 333, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/configuration_utils.py", line 391, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/configuration_utils.py", line 474, in _dict_from_json_file
    text = reader.read()
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 49: invalid start byte

Expected behavior

Want it to train

LysandreJik commented 3 years ago

You should first convert your checkpoint to a huggingface checkpoint, using the conversion script. You can check the docs here.

singhsidhukuldeep commented 3 years ago

Hi @LysandreJik Thank you so much for the response, after training I will get a PyTorch checkpoint, right? What is the procedure to get a tf checkpoint?

singhsidhukuldeep commented 3 years ago

You should first convert your checkpoint to a huggingface checkpoint, using the conversion script. You can check the docs here.

Hi @LysandreJik , I tried the above approach, and I converted it to a huggingface checkpoint.

Now when I run below command:

python run_mlm_wwm.py \
    --model_name_or_path google-bert-tiny/pytorch_model.bin \
    --config_name google-bert-tiny/bert_config.json \
    --train_file train.txt \
    --validation_file val.txt \
    --do_train \
    --do_eval \
    --output_dir test-mlm-wwm \
    --cache_dir cache

I am getting this error:

Traceback (most recent call last):
  File "run_mlm_wwm.py", line 340, in <module>
    main()
  File "run_mlm_wwm.py", line 236, in main
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/tokenization_auto.py", line 306, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/configuration_auto.py", line 333, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/configuration_utils.py", line 391, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/configuration_utils.py", line 474, in _dict_from_json_file
    text = reader.read()
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

@thomwolf

LysandreJik commented 3 years ago

I believe the model_name_or_path should point to a directory containing both the configuration and model files, with their appropriate name (config.json, pytorch_model.bin).

directory 
    - config.json
    - pytorch_model.bin

Regarding your question to convert a model to a TensorFlow implementation, you can first convert your model to PyTorch and then load it in TensorFlow:

Let's say you saved the model in the directory directory:

from transformers import TFBertForPreTraining

pt_model = BertForPreTraining.from_pretrained(directory, from_pt=True)

You can then save it as any other TensorFlow model.

singhsidhukuldeep commented 3 years ago

Hi @LysandreJik

After giving the folder to config and model,

from transformers import convert_pytorch_checkpoint_to_tf2
convert_pytorch_checkpoint_to_tf2.convert_pt_checkpoint_to_tf(
    model_type = "bert", 
    pytorch_checkpoint_path="model/", 
    config_file="model/config.json", 
    tf_dump_path="TFmodel", 
    compare_with_pt_model=False, 
    use_cached_models=False
)

I am getting this error:

Loading PyTorch weights from /home/3551351/bert-mlm/model
Traceback (most recent call last):
  File "pt2tf.py", line 7, in <module>
    convert_pytorch_checkpoint_to_tf2.convert_pt_checkpoint_to_tf(
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/convert_pytorch_checkpoint_to_tf2.py", line 283, in convert_pt_checkpoint_to_tf
    tf_model = load_pytorch_checkpoint_in_tf2_model(tf_model, pytorch_checkpoint_path)
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/transformers/modeling_tf_pytorch_utils.py", line 93, in load_pytorch_checkpoint_in_tf2_model
    pt_state_dict = torch.load(pt_path, map_location="cpu")
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/torch/serialization.py", line 581, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/3551351/.conda/envs/kuldeepVenv/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
IsADirectoryError: [Errno 21] Is a directory: '/home/3551351/bert-mlm/model'
LysandreJik commented 3 years ago

I'm sorry, I think you misunderstood me. I was saying that about the way you launch your script, not the way you do the conversion:

python run_mlm_wwm.py \
    --model_name_or_path google-bert-tiny \
    --config_name google-bert-tiny \
    --train_file train.txt \
    --validation_file val.txt \
    --do_train \
    --do_eval \
    --output_dir test-mlm-wwm \
    --cache_dir cache
github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.