huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.6k stars 401 forks source link

The deepspeed full finetunning get stuck. #21

Closed ChenDRAG closed 11 months ago

ChenDRAG commented 11 months ago

I ran this command CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --main_process_port 0 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_full.yaml

INFO:root:Using nproc_per_node=8.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:torch.distributed.elastic.rendezvous.static_tcp_rendezvous:Creating TCPStore as the c10d::Store implementation
[2023-11-14 08:13:14,714] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 08:13:14,899] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 08:13:14,914] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
......
/git/trl/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
[2023-11-14 08:13:19,132] [INFO] [comm.py:637:init_distributed] cdb=None
/git/trl/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
[2023-11-14 08:13:19,235] [INFO] [comm.py:637:init_distributed] cdb=None

The program is stuck after outputting the above log info. I do not know what's wrong since there is no error message. Can you help me with that?

ChenDRAG commented 11 months ago

I found the problem. --main_process_port 0 causes communication error. Thanks