Open seyyedaliayati opened 1 year ago
I don't know if that has a link. But I was using cuda 11.6, but it showed me warnings that I had to use version 1.17. And since also this error message
i meet same problem
I don't know if that has a link. But I was using cuda 11.6, but it showed me warnings that I had to use version 1.17. And since also this error message
Was it resolved using cuda 1.17? @EnzoDeg40
I don't know if that has a link. But I was using cuda 11.6, but it showed me warnings that I had to use version 1.17. And since also this error message
Was it resolved using cuda 1.17? @EnzoDeg40
In cuda 11.6 it didn't work (I couldn't remember the exact error) and I had a warning advising me to use version 11.7. And since I upgrade to 11.7, I have this error -9. I was wondering if cuda version had anything to do with.
maybe because of CPU OOM ?
maybe because of CPU OOM ?
Can you provide more details please?
maybe because of CPU OOM ?
Can you provide more details please?
I watch my RAM reaching 256g/256g, then it ERROR occurs
maybe because of CPU OOM ?
Can you provide more details please?
I watch my RAM reaching 256g/256g, then it ERROR occurs
Ops! Do you know how much RAM is required?
I have upgraded my GPUs to A100 40GB of memory but I still have the same issue :( Could you please help me? @lxuechen @rtaori
same problem, do not know how to fix
same problem, do not know how to fix
It may be caused by the RAM or GPU mem. I got the same problem in 100GB RAM and 1 A100 40GB GPU, but it was fixed by running in 300GB RAM and 4 A100 40GB GPUs.
I have upgraded my GPUs to A100 40GB of memory but I still have the same issue :( Could you please help me? @lxuechen @rtaori
In my case, 4 A100 40GB GPUs would help, but it will also occur OOM on GPU in the 75-th iteration.
same problem, do not know how to fix
I solved it by upgrading python 3.7 to python 3.9
same problem, do not know how to fix
I solved it by upgrading python 3.7 to python 3.9
What is your hardware configuration? I encounter this issue even when using Python 3.10.
i solve this problem by add config, now 4 * V100 32G RAM 328G,I can run 13B
Hi @codemaster17611 ,
I am currently using one A100 GPU (40GB), and while running the fine-tuning program, I continuously execute commands free -h
and nvidia-smi
to monitor the logs. I noticed that there is very little memory consumption. Is this normal?
Can you help me?
It crashed when there were still 281GB of free space available. There is almost no utilization of GPU RAM.
log:
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : FastChat/fastchat/train/train_mem.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:20001
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_xd6ijrh1/none_wczn4hm_
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:20001.
[I socket.cpp:787] [c10d] The client socket has connected to [localhost]:20001 on [localhost]:49826.
[I socket.cpp:787] [c10d] The client socket has connected to [localhost]:20001 on [localhost]:49828.
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=20001
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_xd6ijrh1/none_wczn4hm_/attempt_0/0/error.json
[I socket.cpp:787] [c10d] The client socket has connected to [localhost]:20001 on [localhost]:49832.
[I socket.cpp:787] [c10d] The client socket has connected to [localhost]:20001 on [localhost]:49834.
[I ProcessGroupNCCL.cpp:665] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:842] [Rank 0] NCCL watchdog thread started!
/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 12003) of binary: /root/miniconda3/envs/vicuna/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0039861202239990234 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 0 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
File "/root/miniconda3/envs/vicuna/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
FastChat/fastchat/train/train_mem.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-05-16_08:23:55
host : mx-69977d7b58-zrz6r
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 12003)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 12003
======================================================
用了deepspeed不?
Minxiangliu
用了deepspeed不?
Minxiangliu
I am not using DeepSpeed. Here are the commands I am running.
TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO \
torchrun --nproc_per_node=1 --master_port=20001 FastChat/fastchat/train/train_mem.py \
--model_name_or_path /raid/minxiang83/Program/vicuna/llama-7b \
--data_path /raid/minxiang83/Program/vicuna/datasets/dummy.json \
--bf16 True \
--output_dir finetune_output \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True
用了deepspeed不?
Minxiangliu
I am not using DeepSpeed. Here are the commands I am running.
TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO \ torchrun --nproc_per_node=1 --master_port=20001 FastChat/fastchat/train/train_mem.py \ --model_name_or_path /raid/minxiang83/Program/vicuna/llama-7b \ --data_path /raid/minxiang83/Program/vicuna/datasets/dummy.json \ --bf16 True \ --output_dir finetune_output \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 10 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True
那40G显存不够的。。。我这4*32G v100 配置尝试了下不用deepspeed直接OOM
用了deepspeed不?
Minxiangliu
I am not using DeepSpeed. Here are the commands I am running.
TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO \ torchrun --nproc_per_node=1 --master_port=20001 FastChat/fastchat/train/train_mem.py \ --model_name_or_path /raid/minxiang83/Program/vicuna/llama-7b \ --data_path /raid/minxiang83/Program/vicuna/datasets/dummy.json \ --bf16 True \ --output_dir finetune_output \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 10 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True
那40G显存不够的。。。我这4*32G v100 配置尝试了下不用deepspeed直接OOM
So, would you recommend using the DeepSpeed approach for training?
我也是4*32G v100 但是没办法跑起来? 是不是128G内存不太够呀
我也是4*32G v100 但是没办法跑起来? 是不是128G内存不太够呀
I ended up using the following configuration to complete the fine-tuning process. https://github.com/lm-sys/FastChat/issues/1200#issuecomment-1556866764
我也是4*32G v100 但是没办法跑起来? 是不是128G内存不太够呀
I ended up using the following configuration to complete the fine-tuning process. lm-sys/FastChat#1200 (comment)
Thx !
I got the same error, but I noticed empty records within json. It solves the problem for me.
same
可能是由于GPU或RAM内存不足造成的 我一开始也遇到了这个问题,我的配置是这样的 torch:2.2.1 cuda:12.1 cudnn:8 python:3.10 GPU:A40 48G (开启 deepspeed ,使用 ZerO3,bf16) RAM:52G model:Llama-2-7b-chat-hf
后来更改了配置可以工作了,配置如下: CPU num:56 RAM size: 256G GPU: V100 16G * 8
我开启了deepspeed,同时关闭了bf16 TF32,使用fp16
因此需要对官方的bash和deepspeed的json进行修改:
{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupDecayLR", "params": { "total_num_steps": "auto", "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": false }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 5, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
bash 相对的也进行修改
torchrun --nproc_per_node=8 train.py \ --model_name_or_path /workspace/Llama-2-7b-chat-hf \ --data_path ./alpaca_data.json \ --output_dir ./alpaca_out \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --deepspeed "./configs/default_offload_opt_param.json"
最后结果:
I'm using Accelerate + DeepSpeed, change "debug: True" -> "debug: False" and it works!
I am trying to re-train alpaca on the following machine:
Here is my command to start training:
But I got the following errors:
Could you please help me?