huggingface / transformers

๐Ÿค— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.89k stars 26.99k forks source link

Multi GPU training crashes when running run_mlm_wwm.py #17033

Closed conan1024hao closed 2 years ago

conan1024hao commented 2 years ago

System Info

I am running this script on a 8 A100 cards cluster.

gcc/11.2.0
python/3.8/3.8.13
cuda/11.3/11.3.1
cudnn/8.2/8.2.4
nccl/2.9/2.9.9-1

accelerate         0.7.1
datasets           2.1.0
huggingface-hub    0.5.1
protobuf           3.20.1
sentencepiece      0.1.96
tokenizers         0.12.1
torch              1.11.0+cu113
torchaudio         0.11.0+cu113
torchvision        0.12.0+cu113
transformers       4.18.0

Who can help?

@wlhgtc Sorry to bother you again, please check this issue if you have time๐Ÿ™.

Information

Tasks

Reproduction

Dataset example

My dataset is Chinese, Japanese and Korean's Wikipedia. And I generate ref files for not only Chinese but for all whole words.

mrph_train.txt
็ตฑไธ€ ็„ไธญ ่€… ็ต„ๅˆ
็ตฑไธ€ ็„ไธญ ่€… ็ต„ๅˆ ๏ผˆ ใจใ†ใ„ใค ใ”ใใกใ‚…ใ† ใ—ใ‚ƒใ ใฟใ‚ใ„ ๏ผ‰ ใฏ ใ€ ๆ—ฅๆœฌ ใฎ ๅˆ‘ๅ‹™ๆ‰€ ใซ ๅœจ็›ฃ ใ—ใฆ ใ„ใ‚‹ ๅ—ๅˆ‘ ่€… ใซ ใ‚ˆใฃใฆ ็ตๆˆ ใ• ใ‚ŒใŸ ็ต„็น” ใ€‚ ็พๅœจ ใ€ ๆ—ฅๆœฌ ใง ๅ”ฏไธ€ ใฎ ใ€Œ ๅ›šไบบ ็ต„ๅˆ ใ€ ็ต„็น” ใงใ‚ใ‚‹ ใ€‚
ๆฒฟ้ฉ ๏ผŽ
ๆ˜Žๆฒป ๆ™‚ไปฃ ไปฅ้™ ใ€ ๆ—ฅๆœฌ ใฎ ๅˆ‘ๅ‹™ๆ‰€ ใง ใฏ ๅ—ๅˆ‘ ่€… ่‡ช่บซ ใŒ ่กŒ ๅˆ‘ ใฎ ้‹ๅ–ถ ใซ ใ‚ใŸใ‚‹ ใ€Œ ๅ›šไบบ ่‡ชๆฒป ใ€ ใ‚’ ่ชใ‚ใฆ ใ„ ใชใ„ ใ€‚ ใ“ใ‚Œ ใฏ ๆฑŸๆˆธ ๆ™‚ไปฃ ใฎ ไผ้ฆฌ ็”บ ็‰ข ๅฑ‹ๆ•ท ใฎ ใ‚ˆใ†ใซ ๅ—ๅˆ‘ ่€… ใฎ ไปฃ่กจ ใงใ‚ใ‚‹ ็‰ข ๅไธป ใŒ ็‰ข็„ ใ‚’ ไป•ๅˆ‡ใ‚‹ ใ“ใจ ใง ใ€ ็ตๆžœ ใจ ใ—ใฆ ๅ—ๅˆ‘ ่€… ใฎ ๅ‡ฆ้‡ ใŒ ๅŠฃๆ‚ช ๅŒ– ใ—ใŸ ใ“ใจ ใซ ๅฏพใ™ใ‚‹ ๅ็œ ใ‹ใ‚‰ ๆฅใฆ ใ„ใ‚‹ ใ€‚
ref_train.txt
[2, 4, 7]
[2, 4, 7, 10, 11, 12, 14, 15, 16, 17, 19, 20, 22, 23, 28, 31, 32, 35, 37, 39, 41, 45, 46, 48, 51, 53, 56, 59, 62, 66, 68, 71, 73, 74]
[2]
[2, 4, 6, 9, 12, 13, 17, 20, 26, 29, 30, 33, 35, 39, 40, 43, 46, 49, 51, 54, 58, 61, 62, 64, 68, 70, 71, 74, 77, 80, 81, 83, 87, 90, 92, 96, 99, 102, 104, 107, 108, 110, 112, 114, 116]

Command

torchrun --nproc_per_node 8 run_mlm_wwm.py \
    --model_type bert \
    --tokenizer_name tokenizer.json \
    --train_file mrph_train.txt \
    --validation_file mrph_test.txt \
    --train_ref_file ref_train.txt \
    --validation_ref_file ref_test.txt \
    --config_overrides="pad_token_id=2,hidden_size=512,num_attention_heads=8,num_hidden_layers=4" \
    --max_seq_length 128 \
    --fp16 \
    --per_device_train_batch_size 256 \
    --per_device_eval_batch_size 256 \
    --gradient_accumulation_steps 2 \
    --max_steps 500000 \
    --save_steps 1000 \
    --save_total_limit 5 \
    --do_train \
    --do_eval \

Change in run_mlm_wwm.py

Expected behavior

### Bug info
After loading dataset, it should begin training, but PyTorch crashed at this time.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380593 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380595 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380596 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380600 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 2380594) of binary: /local/9884269.1.gpua/work/bin/python3
Traceback (most recent call last):
  File "/local/9884269.1.gpua/work/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Have tried

wlhgtc commented 2 years ago

@conan1024hao
Sorry I don't know more details about multi gpu training, but you should make sure your code works well in single GPU. And then you could try code like this:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -m torch.distributed.launch --nproc_per_node 8 run_mlm_wwm.py \
    --model_type bert \
    --tokenizer_name tokenizer.json \
    --train_file mrph_train.txt \
    --validation_file mrph_test.txt \
    --train_ref_file ref_train.txt \
    --validation_ref_file ref_test.txt \
    --config_overrides="pad_token_id=2,hidden_size=512,num_attention_heads=8,num_hidden_layers=4" \
    --max_seq_length 128 \
    --fp16 \
    --per_device_train_batch_size 256 \
    --per_device_eval_batch_size 256 \
    --gradient_accumulation_steps 2 \
    --max_steps 500000 \
    --save_steps 1000 \
    --save_total_limit 5 \
    --do_train \
    --do_eval \
conan1024hao commented 2 years ago

@wlhgtc Thank you for your advice. There does exist some bug info which will not be printed when in multi GPU mode. However, after I making sure it can run in single GPU, this error still exist. I will keep this issue open for a solution in the future.

conan1024hao commented 2 years ago

@wlhgtc An update. I found that multi GPU crash when running add_chinese_references(). I ran the whole script successfully after I made the dataset much more smaller. A temprory solution will be preprocessing and saving the tokenized dataset locally by CPU and then start training by multi GPU.

wlhgtc commented 2 years ago

@wlhgtc An update. I found that multi GPU crash when running add_chinese_references(). I ran the whole script successfully after I made the dataset much more smaller. A temprory solution will be preprocessing and saving the tokenized dataset locally by CPU and then start training by multi GPU.

yeah and I met the same problem. This operation of "add_column" needs huge memory, related to some issue in datasets this. There are two ways:

  1. preprocess ref files and merge all info("input_ids",...,"chinese_ref") to a json file, avoid tokenized dataset all the time.
  2. datasets.set_transform(tokenize_function) to lazy load your dataset.

Hope it could help.