微调，使用 DeepSpeed ZeRO-3 平均分配显存，但是会起多个线程同时向不同的显卡里重复加载模型

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.2.dev0
Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.31
Python version: 3.9.16
PyTorch version: 2.0.1+cu117 (GPU)
Transformers version: 4.41.2
Datasets version: 2.18.0
Accelerate version: 0.31.0
PEFT version: 0.11.1
TRL version: 0.9.4
GPU type: B1.gpu.xlarge
DeepSpeed version: 0.14.3

Reproduction

# 运行指令
CUDA_VISIBLE_DEVICES=0,1 FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml

# llama3_lora_sft_ds3.yaml
### model
model_name_or_path: /gemini/Qwen1.5-14B-Chat

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: llama3_law
template: qwen
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/Qwen-14B/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

# 终端显示
[2024-06-14 03:48:29,187] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
06/14/2024 03:48:35 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:24175
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2024-06-14 03:48:51,302] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
[2024-06-14 03:48:51,999] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
[2024-06-14 03:48:58,181] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-14 03:48:58,784] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-14 03:48:58,784] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using config file: /etc/orion/env/env.conf
Using config file: /etc/orion/env/env.conf

# 显卡状态
+--------------------------------------------------------------------------------------------+
| ORION-SMI 1.0             Time: 2024-06-14 03:59:46            CUDA Version: N/A           |
+-----------------------------------------------+----------------------+---------------------+
| IP               vGPU Name       Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC|
| pGPU  vGPU       Physical GPU Name            |         Memory-Usage | GPU-Util  Compute M.|
|===============================================+======================+=====================|
| 10.169.5.3       Orion vGPU              Off  |   N/A            Off |                 N/A |
|  2     0         B1.gpu.xlarge                |  20006MiB / 24258MiB |     99%     Default |
+--------------------------------------------------------------------------------------------+
| 10.169.5.3       Orion vGPU              Off  |   N/A            Off |                 N/A |
|  6     0         B1.gpu.xlarge                |  20006MiB / 24258MiB |      0%     Default |
+--------------------------------------------------------------------------------------------+

+--------------------------------------------------------------------------------------------+
| Processes:                                                                     vGPU Memory |
| IP               pGPU  vGPU   PID    Type   Process name                          Usage    |
|============================================================================================|
|  10.169.5.3         2     0   3397      C   python                                20006MiB |
|  10.169.5.3         6     0   3396      C   python                                20006MiB |
+--------------------------------------------------------------------------------------------+

Expected behavior

No response

Others

No response

hiyouga / LLaMA-Factory

微调，使用 DeepSpeed ZeRO-3 平均分配显存，但是会起多个线程同时向不同的显卡里重复加载模型 #4285

Reminder

System Info

Reproduction

Expected behavior

Others