huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.51k stars 26.9k forks source link

Cannot use custom roberta tokenizer with run_mlm_wwm.py #10720

Closed avacaondata closed 3 years ago

avacaondata commented 3 years ago

Environment info

Who can help

@patrickvonplaten @LysandreJik @

Information

When I try to use the BPE Tokenizer trained with huggingface/tokenizers with Roberta directly, it works:


tok = RobertaTokenizer.from_pretrained("bpe_tokenizer_0903", use_fast=True)

However, when I try to use this same tokenizer for training a language model, it fails:

python -u  transformers/examples/language-modeling/run_mlm_wwm.py \
    --model_type deberta \
    --config_name ./bpe_tokenizer_0903/config.json \
    --tokenizer_name ./bpe_tokenizer_0903 \
    --train_file ./prueba_tr.txt \
    --validation_file ./final_valid.txt  \
    --output_dir ./roberta_1102 \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --evaluation_strategy steps \
    --per_device_train_batch_size 1  \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2  \
    --learning_rate 6e-4  \
    --save_steps 10  \
    --logging_steps 10 \
    --overwrite_cache \
    --max_seq_length 128 \
    --eval_accumulation_steps 10 \
    --load_best_model_at_end \
    --run_name deberta_0902 \
    --save_total_limit 10 --warmup_steps 1750  \
    --adam_beta2 0.98 --adam_epsilon 1e-6 --weight_decay 0.01 --num_train_epochs 1

The error message is the following:

Traceback (most recent call last):
  File "transformers/examples/language-modeling/run_mlm_wwm.py", line 399, in <module>
    main()
  File "transformers/examples/language-modeling/run_mlm_wwm.py", line 286, in main
    use_fast=model_args.use_fast_tokenizer,
  File "/home/alejandro.vaca/data_rigoberta/transformers/src/transformers/models/auto/tokenization_auto.py", line 401, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/alejandro.vaca/data_rigoberta/transformers/src/transformers/tokenization_utils_base.py", line 1719, in from_pretrained
    resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
  File "/home/alejandro.vaca/data_rigoberta/transformers/src/transformers/tokenization_utils_base.py", line 1790, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/alejandro.vaca/data_rigoberta/transformers/src/transformers/models/roberta/tokenization_roberta_fast.py", line 173, in __init__
    **kwargs,
  File "/home/alejandro.vaca/data_rigoberta/transformers/src/transformers/models/gpt2/tokenization_gpt2_fast.py", line 145, in __init__
    **kwargs,
  File "/home/alejandro.vaca/data_rigoberta/transformers/src/transformers/tokenization_utils_fast.py", line 87, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper at line 1 column 1138661

Why doesn't it fail when I try to load the tokenizer with RobertaTokenizer.from_pretrained() but it does fail when I try to run run_mlm_wwm.py ? @sgugger @patrickvonplaten @LysandreJik

sgugger commented 3 years ago

That example only runs with BERT, which is why it has been moved to a separate research project.

avacaondata commented 3 years ago

I tried this script with albert and it worked, which script should I use to train a Roberta model from scratch with Whole word Masking??

cronoik commented 3 years ago

Is that intended: --model_type deberta ? @alexvaca0

avacaondata commented 3 years ago

Sorry, that was from the previous launch script, now it is roberta @cronoik

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.