Closed conan1024hao closed 2 years ago
@conan1024hao
Sorry I don't know more details about multi gpu training, but you should make sure your code works well in single GPU.
And then you could try code like this:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node 8 run_mlm_wwm.py \
--model_type bert \
--tokenizer_name tokenizer.json \
--train_file mrph_train.txt \
--validation_file mrph_test.txt \
--train_ref_file ref_train.txt \
--validation_ref_file ref_test.txt \
--config_overrides="pad_token_id=2,hidden_size=512,num_attention_heads=8,num_hidden_layers=4" \
--max_seq_length 128 \
--fp16 \
--per_device_train_batch_size 256 \
--per_device_eval_batch_size 256 \
--gradient_accumulation_steps 2 \
--max_steps 500000 \
--save_steps 1000 \
--save_total_limit 5 \
--do_train \
--do_eval \
@wlhgtc Thank you for your advice. There does exist some bug info which will not be printed when in multi GPU mode. However, after I making sure it can run in single GPU, this error still exist. I will keep this issue open for a solution in the future.
@wlhgtc An update. I found that multi GPU crash when running add_chinese_references()
. I ran the whole script successfully after I made the dataset much more smaller. A temprory solution will be preprocessing and saving the tokenized dataset locally by CPU and then start training by multi GPU.
@wlhgtc An update. I found that multi GPU crash when running
add_chinese_references()
. I ran the whole script successfully after I made the dataset much more smaller. A temprory solution will be preprocessing and saving the tokenized dataset locally by CPU and then start training by multi GPU.
yeah and I met the same problem. This operation of "add_column" needs huge memory, related to some issue in datasets
this.
There are two ways:
datasets.set_transform(tokenize_function)
to lazy load your dataset.Hope it could help.
System Info
Who can help?
@wlhgtc Sorry to bother you again, please check this issue if you have time๐.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Dataset example
My dataset is Chinese, Japanese and Korean's Wikipedia. And I generate ref files for not only Chinese but for all whole words.
Command
Change in
run_mlm_wwm.py
to
Expected behavior
Have tried
gloo
for torch's backend instead ofnccl
โ