Error (return code -7) when finetuning FLANT5-xxl on 8* A100

scofield7419 commented 1 year ago

Hi Philipp @philschmid ,

Thank you for the wonderful tutorial. These days I've been using your codes to achieve finetuning on Flan-T5-xxl with my own corpus using DeepSpeed as shown here by you. I am using the same configurations as what is given in your codes, e.g., 8*A100, 100 CPU cores, 500 Mem. And I prepared my datasets, and got everything ready.

I've almost done everything well, however still one step away from success: When I execute '_deepspeed --num_gpus=8 scripts/run_seq2seqdeepspeed.py', it seems everything goes fine initially. But soon when the 8 GPUs start working (loaded in with data), the process has immediately been terminated, killed.

$ deepspeed --num_gpus=8 run_seq2seq_deepspeed.py
[2023-02-24 18:10:18,983] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-24 18:10:19,049] [INFO] [runner.py:548:main] cmd = /home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_seq2seq_deepspeed.py
[2023-02-24 18:10:22,043] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-02-24 18:10:22,043] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-02-24 18:10:22,043] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-02-24 18:10:22,043] [INFO] [launch.py:162:main] dist_world_size=8
[2023-02-24 18:10:22,043] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:34<00:00,  6.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:37<00:00,  7.49s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00,  8.11s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:41<00:00,  8.37s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:43<00:00,  8.74s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:44<00:00,  8.95s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 5/5 [00:45<00:00,  9.10s/it]
[2023-02-24 18:12:41,354] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using cuda_amp half precision backend
[2023-02-24 18:12:42,235] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17786
[2023-02-24 18:13:10,284] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17787
[2023-02-24 18:13:10,286] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17788
[2023-02-24 18:13:10,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17789
[2023-02-24 18:13:10,620] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17790
[2023-02-24 18:13:10,953] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17791
[2023-02-24 18:13:10,954] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17792
[2023-02-24 18:13:11,287] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17793
[2023-02-24 18:13:11,901] [ERROR] [launch.py:324:sigkill_handler] ['/home/aiops/xxxx/.miniconda3/envs/torch-py39/bin/python', '-u', 'run_seq2seq_deepspeed.py', '--local_rank=7'] exits with return code = -7

I am pretty sure that all the environment requirements are satisfied as you noted there (8 batch size). But I am not sure where is the problem at. I don't understand the meaning of error code -7 (as in the last line above), even after searching the whole Internet. I am not knowledgeable about the underlying mechanism of GPU parallelism. And from your viewpoint, what could be the most probable factors?

scofield7419 commented 1 year ago

Also I've created this issue on DeepSeed repo: https://github.com/microsoft/DeepSpeed/issues/2897 with more details.

allanj commented 1 year ago

most likely your cpu memory is enough

luxuantao commented 1 year ago

I have the same problem.

philschmid / deep-learning-pytorch-huggingface

Error (return code -7) when finetuning FLANT5-xxl on 8* A100 #7