ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0

seyyedaliayati commented 1 year ago

I am trying to re-train alpaca on the following machine:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   24C    P0    25W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   25C    P0    24W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   25C    P0    26W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   25C    P0    25W /  70W |      2MiB / 15360MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here is my command to start training:

#!/bin/bash

CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=4 --master_port=9292 train.py \
    --model_name_or_path ./models/llama-7b \
    --data_path ./alpaca_data.json \
    --fp16 True \
    --output_dir alpaca_out \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_offload_opt_param.json" \
    --tf32 False

But I got the following errors:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41388 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41389 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 41390 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 41387) of binary: /home/ubuntu/ali/venv/bin/python3
Traceback (most recent call last):
  File "/home/ubuntu/ali/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/ali/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/ali/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/ubuntu/ali/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/ubuntu/ali/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/ali/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-25_18:25:09
  host      : *********************
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 41387)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 41387
========================================================

Could you please help me?

EnzoDeg40 commented 1 year ago

I don't know if that has a link. But I was using cuda 11.6, but it showed me warnings that I had to use version 1.17. And since also this error message

codemaster17611 commented 1 year ago

i meet same problem

seyyedaliayati commented 1 year ago

I don't know if that has a link. But I was using cuda 11.6, but it showed me warnings that I had to use version 1.17. And since also this error message

Was it resolved using cuda 1.17? @EnzoDeg40

EnzoDeg40 commented 1 year ago

I don't know if that has a link. But I was using cuda 11.6, but it showed me warnings that I had to use version 1.17. And since also this error message

Was it resolved using cuda 1.17? @EnzoDeg40

In cuda 11.6 it didn't work (I couldn't remember the exact error) and I had a warning advising me to use version 11.7. And since I upgrade to 11.7, I have this error -9. I was wondering if cuda version had anything to do with.

qwjaskzxl commented 1 year ago

maybe because of CPU OOM ?

seyyedaliayati commented 1 year ago

maybe because of CPU OOM ?

Can you provide more details please?

qwjaskzxl commented 1 year ago

maybe because of CPU OOM ?

Can you provide more details please?

I watch my RAM reaching 256g/256g, then it ERROR occurs

seyyedaliayati commented 1 year ago

maybe because of CPU OOM ?

Can you provide more details please?

I watch my RAM reaching 256g/256g, then it ERROR occurs

Ops! Do you know how much RAM is required?

seyyedaliayati commented 1 year ago

I have upgraded my GPUs to A100 40GB of memory but I still have the same issue :( Could you please help me? @lxuechen @rtaori

freesouls commented 1 year ago

same problem, do not know how to fix

huawei-lin commented 1 year ago

same problem, do not know how to fix

It may be caused by the RAM or GPU mem. I got the same problem in 100GB RAM and 1 A100 40GB GPU, but it was fixed by running in 300GB RAM and 4 A100 40GB GPUs.

huawei-lin commented 1 year ago

I have upgraded my GPUs to A100 40GB of memory but I still have the same issue :( Could you please help me? @lxuechen @rtaori

In my case, 4 A100 40GB GPUs would help, but it will also occur OOM on GPU in the 75-th iteration.

freesouls commented 1 year ago

same problem, do not know how to fix

I solved it by upgrading python 3.7 to python 3.9

Minxiangliu commented 1 year ago

same problem, do not know how to fix

I solved it by upgrading python 3.7 to python 3.9

What is your hardware configuration? I encounter this issue even when using Python 3.10.

codemaster17611 commented 1 year ago

i solve this problem by add config, now 4 * V100 32G RAM 328G，I can run 13B

Minxiangliu commented 1 year ago

Hi @codemaster17611 , I am currently using one A100 GPU (40GB), and while running the fine-tuning program, I continuously execute commands free -h and nvidia-smi to monitor the logs. I noticed that there is very little memory consumption. Is this normal?

Can you help me?

It crashed when there were still 281GB of free space available. There is almost no utilization of GPU RAM.

log:

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : FastChat/fastchat/train/train_mem.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:20001
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_xd6ijrh1/none_wczn4hm_
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:20001.
[I socket.cpp:787] [c10d] The client socket has connected to [localhost]:20001 on [localhost]:49826.
[I socket.cpp:787] [c10d] The client socket has connected to [localhost]:20001 on [localhost]:49828.
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=20001
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_xd6ijrh1/none_wczn4hm_/attempt_0/0/error.json
[I socket.cpp:787] [c10d] The client socket has connected to [localhost]:20001 on [localhost]:49832.
[I socket.cpp:787] [c10d] The client socket has connected to [localhost]:20001 on [localhost]:49834.
[I ProcessGroupNCCL.cpp:665] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:842] [Rank 0] NCCL watchdog thread started!
/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
Loading checkpoint shards:   0%|                                          | 0/2 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 12003) of binary: /root/miniconda3/envs/vicuna/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0039861202239990234 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 0 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
  File "/root/miniconda3/envs/vicuna/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
FastChat/fastchat/train/train_mem.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-16_08:23:55
  host      : mx-69977d7b58-zrz6r
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 12003)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 12003
======================================================

codemaster17611 commented 1 year ago

用了deepspeed不？

Minxiangliu

Minxiangliu commented 1 year ago

用了deepspeed不？

Minxiangliu

I am not using DeepSpeed. Here are the commands I am running.

TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO \
torchrun --nproc_per_node=1 --master_port=20001 FastChat/fastchat/train/train_mem.py \
    --model_name_or_path /raid/minxiang83/Program/vicuna/llama-7b  \
    --data_path /raid/minxiang83/Program/vicuna/datasets/dummy.json \
    --bf16 True \
    --output_dir finetune_output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

codemaster17611 commented 1 year ago

用了deepspeed不？

Minxiangliu

I am not using DeepSpeed. Here are the commands I am running.

TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO \
torchrun --nproc_per_node=1 --master_port=20001 FastChat/fastchat/train/train_mem.py \
    --model_name_or_path /raid/minxiang83/Program/vicuna/llama-7b  \
    --data_path /raid/minxiang83/Program/vicuna/datasets/dummy.json \
    --bf16 True \
    --output_dir finetune_output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

那40G显存不够的。。。我这4*32G v100 配置尝试了下不用deepspeed直接OOM

Minxiangliu commented 1 year ago

用了deepspeed不？

Minxiangliu

I am not using DeepSpeed. Here are the commands I am running.

TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO \
torchrun --nproc_per_node=1 --master_port=20001 FastChat/fastchat/train/train_mem.py \
    --model_name_or_path /raid/minxiang83/Program/vicuna/llama-7b  \
    --data_path /raid/minxiang83/Program/vicuna/datasets/dummy.json \
    --bf16 True \
    --output_dir finetune_output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

那40G显存不够的。。。我这4*32G v100 配置尝试了下不用deepspeed直接OOM

So, would you recommend using the DeepSpeed approach for training?

LebronXierunfeng commented 1 year ago

我也是4*32G v100 但是没办法跑起来？是不是128G内存不太够呀

Minxiangliu commented 1 year ago

我也是4*32G v100 但是没办法跑起来？是不是128G内存不太够呀

I ended up using the following configuration to complete the fine-tuning process. https://github.com/lm-sys/FastChat/issues/1200#issuecomment-1556866764

LebronXierunfeng commented 1 year ago

我也是4*32G v100 但是没办法跑起来？是不是128G内存不太够呀

I ended up using the following configuration to complete the fine-tuning process. lm-sys/FastChat#1200 (comment)

Thx !

aqppe commented 1 year ago

I got the same error, but I noticed empty records within json. It solves the problem for me.

lonngxiang commented 6 months ago

same

kkk935208447 commented 6 months ago

可能是由于GPU或RAM内存不足造成的我一开始也遇到了这个问题，我的配置是这样的 torch：2.2.1 cuda：12.1 cudnn：8 python：3.10 GPU：A40 48G （开启 deepspeed ，使用 ZerO3，bf16) RAM：52G model：Llama-2-7b-chat-hf

后来更改了配置可以工作了，配置如下： CPU num：56 RAM size: 256G GPU: V100 16G * 8

我开启了deepspeed，同时关闭了bf16 TF32，使用fp16 因此需要对官方的bash和deepspeed的json进行修改： { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupDecayLR", "params": { "total_num_steps": "auto", "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": false }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 5, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

bash 相对的也进行修改 torchrun --nproc_per_node=8 train.py \ --model_name_or_path /workspace/Llama-2-7b-chat-hf \ --data_path ./alpaca_data.json \ --output_dir ./alpaca_out \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --deepspeed "./configs/default_offload_opt_param.json"

最后结果：

TYZY89 commented 5 months ago

I'm using Accelerate + DeepSpeed, change "debug: True" -> "debug: False" and it works!

tatsu-lab / stanford_alpaca

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #245