huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.73k stars 26.94k forks source link

Cannot train model from scratch using `run_mlm.py`. #8590

Closed GuillemGSubies closed 3 years ago

GuillemGSubies commented 3 years ago

Looks like the trainer does not like when it gets a None, so when we train from scratch, there is a None in this if and crashes:

https://github.com/huggingface/transformers/blob/a6cf9ca00b74a8b2244421a6101b83d8cf43cd6b/examples/language-modeling/run_mlm.py#L357

I solved it by deleting that line, but I guess it could affect to other use cases.

To reproduce, call run_mlm this way (I guess it is easier to reproduce, but this might be enough):

python  run_mlm.py \
    --model_type bert \
    --train_file ./data/oscar_1000.txt \
    --validation_file ./data/oscar_1000_valid.txt \
    --output_dir testing_model \
    --tokenizer_name bert-base-spanish-wwm-cased  \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --evaluation_strategy steps \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --max_steps 500 \
    --save_steps 2000 \
    --save_total_limit 15 \
    --overwrite_cache \
    --max_seq_length 512 \
    --eval_accumulation_steps 10 \
    --logging_steps 1000 \

The dataset I'm using I guess that isn't relevant so any corpus will do.

@sgugger

sgugger commented 3 years ago

Mmm, that is weird as None is the default for that argument. Will investigate this when I'm finished with v4 stuff, thanks for flagging!