Closed ratthachat closed 4 years ago
Have you tried launching a debugger to see exactly what takes a long time?
I would use vscode remote debugging.
I would guess that your tokenization process takes too long. If you're training a new LM from scratch, I would recommend using the fast Tokenizers library written in Rust. You can initialize a new ByteLevelBPETokenizer
instance in your LineByLineTextDataset
class and encode_batch
your text with it.
Thanks you guys, I finally managed to finetune XLM-Roberta-Large, but have to wait for 11 hours, before the training start!
Since I did not want training from scratch, I took a tip from @mfilipav to convert pretrained tokenizer to fast-tokenizer (and since it's SentencePiece, I have to usesentencepiece_extractor.py
), and modify use_fast = True
in run_language_modeling.py
... However, since it's still 11 hours of waiting, maybe this doesn't help.
UPDATED : By adding --line_by_line
option, the training start very quickly, close the issue!
@ratthachat and how fast it became after enabling "--line_by_line true" ? I am waiting for almost 1 hour. My training set size is 11 gb and here goes my parameters `export TRAIN_FILE=/hdd/sifat/NLP/intent_classification/bert_train.txt export TEST_FILE=/hdd/sifat/NLP/intent_classification/data_corpus/test.txt
python examples/run_language_modeling.py \ --output_dir ./bert_output \ --model_type=bert \ --model_name_or_path=bert-base-multilingual-cased \ --mlm \ --line_by_line true \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --learning_rate 1e-4 \ --num_train_epochs 3 \ --save_total_limit 2 \ --save_steps 2000 \ --per_gpu_train_batch_size 5 \ --evaluate_during_training \ --seed 42`
Zaowad, your training file is much bigger than mine so I guess 1 hour is not bad ;) You can also try fp16 option as well
Hi HuggingFace, thank you very much for your great contribution.
❓ Questions & Help
My problem is : run_languagemodeling.py takes abnormally long time for
xlm-roberta-large & base
**"before" start training_** . It got stuck at the following step for 7 hours (so I gave up eventually) :transformers.data.datasets.language_modeling - Creating features from dataset file at ./
I have successfully running
gpt2-large
,distilbert-base-multilingual-cased
using exactly the same command below (just change model) which start training within just 2-3 minutes. At first I thought that because of the big size of XLM-Roberta. However, asgpt2-large
has similar size, is there somehow problem on finetuning with XLM-Roberta? (So maybe a bug in the current version)I also tried to rerun the same command in another machine, but got the same stuck (which is not the case for
gpt2-large
,distilbert-base-multilingual-cased
)update the same thing happen to
xlm-roberta-base
Command Details I used
Machine AWS p3.2xlarge (V100, 64GB Ram) Training file size is around 60MB
!python transformers/examples/run_language_modeling.py \ --model_type=xlm-roberta \ --model_name_or_path=xlm-roberta-large \ --do_train \ --mlm \ --per_gpu_train_batch_size=1 \ --gradient_accumulation_steps=8 \ --train_data_file={TRAIN_FILE} \ --num_train_epochs=2 \ --block_size=225 \ --output_dir=output_lm \ --save_total_limit=1 \ --save_steps=10000 \ --cache_dir=output_lm \ --overwrite_cache \ --overwrite_output_dir