huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.02k stars 26.55k forks source link

xlm-roberta (large/base) : run_language_modeling.py cannot starting training #3919

Closed ratthachat closed 4 years ago

ratthachat commented 4 years ago

Hi HuggingFace, thank you very much for your great contribution.

❓ Questions & Help

My problem is : run_languagemodeling.py takes abnormally long time for xlm-roberta-large & base **"before" start training_** . It got stuck at the following step for 7 hours (so I gave up eventually) :

transformers.data.datasets.language_modeling - Creating features from dataset file at ./

I have successfully running gpt2-large, distilbert-base-multilingual-cased using exactly the same command below (just change model) which start training within just 2-3 minutes. At first I thought that because of the big size of XLM-Roberta. However, as gpt2-large has similar size, is there somehow problem on finetuning with XLM-Roberta? (So maybe a bug in the current version)

I also tried to rerun the same command in another machine, but got the same stuck (which is not the case for gpt2-large, distilbert-base-multilingual-cased )

update the same thing happen to xlm-roberta-base

Command Details I used

Machine AWS p3.2xlarge (V100, 64GB Ram) Training file size is around 60MB

!python transformers/examples/run_language_modeling.py \ --model_type=xlm-roberta \ --model_name_or_path=xlm-roberta-large \ --do_train \ --mlm \ --per_gpu_train_batch_size=1 \ --gradient_accumulation_steps=8 \ --train_data_file={TRAIN_FILE} \ --num_train_epochs=2 \ --block_size=225 \ --output_dir=output_lm \ --save_total_limit=1 \ --save_steps=10000 \ --cache_dir=output_lm \ --overwrite_cache \ --overwrite_output_dir

julien-c commented 4 years ago

Have you tried launching a debugger to see exactly what takes a long time?

I would use vscode remote debugging.

mfilipav commented 4 years ago

I would guess that your tokenization process takes too long. If you're training a new LM from scratch, I would recommend using the fast Tokenizers library written in Rust. You can initialize a new ByteLevelBPETokenizer instance in your LineByLineTextDataset class and encode_batch your text with it.

ratthachat commented 4 years ago

Thanks you guys, I finally managed to finetune XLM-Roberta-Large, but have to wait for 11 hours, before the training start!

Since I did not want training from scratch, I took a tip from @mfilipav to convert pretrained tokenizer to fast-tokenizer (and since it's SentencePiece, I have to usesentencepiece_extractor.py ), and modify use_fast = True in run_language_modeling.py ... However, since it's still 11 hours of waiting, maybe this doesn't help.

UPDATED : By adding --line_by_line option, the training start very quickly, close the issue!

zaowad commented 4 years ago

@ratthachat and how fast it became after enabling "--line_by_line true" ? I am waiting for almost 1 hour. My training set size is 11 gb and here goes my parameters `export TRAIN_FILE=/hdd/sifat/NLP/intent_classification/bert_train.txt export TEST_FILE=/hdd/sifat/NLP/intent_classification/data_corpus/test.txt

python examples/run_language_modeling.py \ --output_dir ./bert_output \ --model_type=bert \ --model_name_or_path=bert-base-multilingual-cased \ --mlm \ --line_by_line true \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --learning_rate 1e-4 \ --num_train_epochs 3 \ --save_total_limit 2 \ --save_steps 2000 \ --per_gpu_train_batch_size 5 \ --evaluate_during_training \ --seed 42`

ratthachat commented 4 years ago

Zaowad, your training file is much bigger than mine so I guess 1 hour is not bad ;) You can also try fp16 option as well