Multi GPU training crashes when running run_mlm_wwm.py

conan1024hao commented 2 years ago

System Info

I am running this script on a 8 A100 cards cluster.

gcc/11.2.0
python/3.8/3.8.13
cuda/11.3/11.3.1
cudnn/8.2/8.2.4
nccl/2.9/2.9.9-1

accelerate         0.7.1
datasets           2.1.0
huggingface-hub    0.5.1
protobuf           3.20.1
sentencepiece      0.1.96
tokenizers         0.12.1
torch              1.11.0+cu113
torchaudio         0.11.0+cu113
torchvision        0.12.0+cu113
transformers       4.18.0

Who can help?

@wlhgtc Sorry to bother you again, please check this issue if you have time🙏.

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Dataset example

My dataset is Chinese, Japanese and Korean's Wikipedia. And I generate ref files for not only Chinese but for all whole words.

mrph_train.txt
統一 獄中 者 組合
統一 獄中 者 組合 （ とういつ ごくちゅう しゃく みあい ） は 、 日本 の 刑務所 に 在監 して いる 受刑 者 に よって 結成 さ れた 組織 。 現在 、 日本 で 唯一 の 「 囚人 組合 」 組織 である 。
沿革 ．
明治 時代 以降 、 日本 の 刑務所 で は 受刑 者 自身 が 行 刑 の 運営 に あたる 「 囚人 自治 」 を 認めて い ない 。 これ は 江戸 時代 の 伝馬 町 牢 屋敷 の ように 受刑 者 の 代表 である 牢 名主 が 牢獄 を 仕切る こと で 、 結果 と して 受刑 者 の 処遇 が 劣悪 化 した こと に 対する 反省 から 来て いる 。
ref_train.txt
[2, 4, 7]
[2, 4, 7, 10, 11, 12, 14, 15, 16, 17, 19, 20, 22, 23, 28, 31, 32, 35, 37, 39, 41, 45, 46, 48, 51, 53, 56, 59, 62, 66, 68, 71, 73, 74]
[2]
[2, 4, 6, 9, 12, 13, 17, 20, 26, 29, 30, 33, 35, 39, 40, 43, 46, 49, 51, 54, 58, 61, 62, 64, 68, 70, 71, 74, 77, 80, 81, 83, 87, 90, 92, 96, 99, 102, 104, 107, 108, 110, 112, 114, 116]

Command

torchrun --nproc_per_node 8 run_mlm_wwm.py \
    --model_type bert \
    --tokenizer_name tokenizer.json \
    --train_file mrph_train.txt \
    --validation_file mrph_test.txt \
    --train_ref_file ref_train.txt \
    --validation_ref_file ref_test.txt \
    --config_overrides="pad_token_id=2,hidden_size=512,num_attention_heads=8,num_hidden_layers=4" \
    --max_seq_length 128 \
    --fp16 \
    --per_device_train_batch_size 256 \
    --per_device_eval_batch_size 256 \
    --gradient_accumulation_steps 2 \
    --max_steps 500000 \
    --save_steps 1000 \
    --save_total_limit 5 \
    --do_train \
    --do_eval \

Change in `run_mlm_wwm.py`

To use my own tokenizer, I changed

if model_args.tokenizer_name:
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name, **tokenizer_kwargs
    )
elif model_args.model_name_or_path:
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path, **tokenizer_kwargs
    )
else:
    raise ValueError(
        "You are instantiating a new tokenizer from scratch. This is not supported by this script."
        "You can do it from another script, save it, and load it from here, using --tokenizer_name."
    )

to

tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

Expected behavior

### Bug info
After loading dataset, it should begin training, but PyTorch crashed at this time.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380593 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380595 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380596 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380597 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380598 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380599 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2380600 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 2380594) of binary: /local/9884269.1.gpua/work/bin/python3
Traceback (most recent call last):
  File "/local/9884269.1.gpua/work/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/local/9884269.1.gpua/work/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Have tried

use gloo for torch's backend instead of nccl ❌
use torch1.10.0 instead of 1.11.0 ❌
use V100 cluster instead of A100 ❌

wlhgtc commented 2 years ago

@conan1024hao
Sorry I don't know more details about multi gpu training, but you should make sure your code works well in single GPU. And then you could try code like this:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -m torch.distributed.launch --nproc_per_node 8 run_mlm_wwm.py \
    --model_type bert \
    --tokenizer_name tokenizer.json \
    --train_file mrph_train.txt \
    --validation_file mrph_test.txt \
    --train_ref_file ref_train.txt \
    --validation_ref_file ref_test.txt \
    --config_overrides="pad_token_id=2,hidden_size=512,num_attention_heads=8,num_hidden_layers=4" \
    --max_seq_length 128 \
    --fp16 \
    --per_device_train_batch_size 256 \
    --per_device_eval_batch_size 256 \
    --gradient_accumulation_steps 2 \
    --max_steps 500000 \
    --save_steps 1000 \
    --save_total_limit 5 \
    --do_train \
    --do_eval \

conan1024hao commented 2 years ago

@wlhgtc Thank you for your advice. There does exist some bug info which will not be printed when in multi GPU mode. However, after I making sure it can run in single GPU, this error still exist. I will keep this issue open for a solution in the future.

conan1024hao commented 2 years ago

@wlhgtc An update. I found that multi GPU crash when running add_chinese_references(). I ran the whole script successfully after I made the dataset much more smaller. A temprory solution will be preprocessing and saving the tokenized dataset locally by CPU and then start training by multi GPU.

wlhgtc commented 2 years ago

@wlhgtc An update. I found that multi GPU crash when running add_chinese_references(). I ran the whole script successfully after I made the dataset much more smaller. A temprory solution will be preprocessing and saving the tokenized dataset locally by CPU and then start training by multi GPU.

yeah and I met the same problem. This operation of "add_column" needs huge memory, related to some issue in datasets this. There are two ways:

preprocess ref files and merge all info("input_ids",...,"chinese_ref") to a json file, avoid tokenized dataset all the time.
datasets.set_transform(tokenize_function) to lazy load your dataset.

Hope it could help.

huggingface / transformers