多机多卡训练出现问题

Jamly7 commented 4 days ago

设备为两台linux，每台2张A100 40G显卡：A100(40G) * 2 训练命令如下：主节点命令为CUDA_VISIBLE_DEVICES=0,1 NNODES=2 NODE_RANK=0 NPROC_PER_NODE=2 MASTER_ADDR=127.0.0.1 swift sft --model_type qwen1half-7b-chat --model_id_or_path /mnt/model_repository/Qwen1.5-7B-Chat/ --dataset /root/lh/data2.jsonl --output_dir /root/lh/output/ --add_output_dir_suffix false --deepspeed default-zero3 --ddp_backend=nccl；子节点命令rank为 1，master_addr 为主节点 ip；运行后，无报错，但程序没响应，一直卡在加载模型。 run sh: torchrun --nproc_per_node 2 --nnodes 2 --node_rank 0 --master_addr 127.0.0.1 /root/lh/swift/swift/cli/sft.py --model_type qwen1half-7b-chat --model_id_or_path /mnt/model_repository/Qwen1.5-7B-Chat/ --dataset /root/lh/data2.jsonl --output_dir /root/lh/output/ --add_output_dir_suffix false --deepspeed default-zero3 --ddp_backend=nccl W0625 08:54:53.020000 139898249515648 torch/distributed/run.py:757] W0625 08:54:53.020000 139898249515648 torch/distributed/run.py:757] W0625 08:54:53.020000 139898249515648 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0625 08:54:53.020000 139898249515648 torch/distributed/run.py:757] 2024-06-25 08:55:01,191 - modelscope - INFO - PyTorch version 2.3.0 Found. 2024-06-25 08:55:01,191 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-06-25 08:55:01,218 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 59204b9b662d35f89093211a27a896f8 and a total number of 980 components indexed 2024-06-25 08:55:01,493 - modelscope - INFO - PyTorch version 2.3.0 Found. 2024-06-25 08:55:01,493 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-06-25 08:55:01,520 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 59204b9b662d35f89093211a27a896f8 and a total number of 980 components indexed [INFO:swift] Successfully registered /root/lh/swift/swift/llm/data/dataset_info.json [INFO:swift] Start time of running main: 2024-06-25 08:55:01.966367 [2024-06-25 08:55:02,060] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.0), only 1.0.0 is known to be compatible [INFO:swift] Setting template_type: qwen [INFO:swift] Using deepspeed: {'fp16': {'enabled': 'auto', 'loss_scale': 0, 'loss_scale_window': 1000, 'initial_scale_power': 16, 'hysteresis': 2, 'min_loss_scale': 1}, 'bf16': {'enabled': 'auto'}, 'optimizer': {'type': 'AdamW', 'params': {'lr': 'auto', 'betas': 'auto', 'eps': 'auto', 'weight_decay': 'auto'}}, 'scheduler': {'type': 'WarmupDecayLR', 'params': {'total_num_steps': 'auto', 'warmup_min_lr': 'auto', 'warmup_max_lr': 'auto', 'warmup_num_steps': 'auto'}}, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'none', 'pin_memory': True}, 'offload_param': {'device': 'none', 'pin_memory': True}, 'overlap_comm': True, 'contiguous_gradients': True, 'sub_group_size': 1000000000.0, 'reduce_bucket_size': 'auto', 'stage3_prefetch_bucket_size': 'auto', 'stage3_param_persistence_threshold': 'auto', 'stage3_max_live_parameters': 1000000000.0, 'stage3_max_reuse_distance': 1000000000.0, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_accumulation_steps': 'auto', 'gradient_clipping': 'auto', 'steps_per_print': 2000, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'wall_clock_breakdown': False} [INFO:swift] Setting args.lazy_tokenize: False [INFO:swift] Setting args.dataloader_num_workers: 1 [2024-06-25 08:55:02,247] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-25 08:55:02,310] [INFO] [comm.py:637:init_distributed] cdb=None device_count: 2 rank: 1, local_rank: 1, world_size: 3, local_world_size: 2 [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.0), only 1.0.0 is known to be compatible [2024-06-25 08:55:02,495] [INFO] [comm.py:637:init_distributed] cdb=None [2024-06-25 08:55:02,495] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [INFO:swift] args: SftArguments(model_type='qwen1half-7b-chat', model_id_or_path='/mnt/model_repository/Qwen1.5-7B-Chat', model_revision='master', sft_type='lora', freeze_parameters=0.0, additional_trainable_parameters=[], tuner_backend='peft', template_type='qwen', output_dir='/root/lh/output', add_output_dir_suffix=False, ddp_backend='nccl', ddp_find_unused_parameters=None, ddp_broadcast_buffers=None, seed=42, resume_from_checkpoint=None, resume_only_model=False, ignore_data_skip=False, dtype='bf16', packing=False, dataset=['/root/lh/data2.jsonl'], val_dataset=[], dataset_seed=42, dataset_test_ratio=0.01, use_loss_scale=False, loss_scale_config_path='/root/lh/swift/swift/llm/agent/default_loss_scale_config.json', system=None, tools_prompt='react_en', max_length=2048, truncation_strategy='delete', check_dataset_strategy='none', model_name=[None, None], model_author=[None, None], quant_method=None, quantization_bit=0, hqq_axis=0, hqq_dynamic_config_path=None, bnb_4bit_comp_dtype='bf16', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, lora_target_modules=['q_proj', 'k_proj', 'v_proj'], lora_rank=8, lora_alpha=32, lora_dropout_p=0.05, lora_bias_trainable='none', lora_modules_to_save=[], lora_dtype='AUTO', lora_lr_ratio=None, use_rslora=False, use_dora=False, init_lora_weights='true', rope_scaling=None, boft_block_size=4, boft_block_num=0, boft_n_butterfly_factor=1, boft_target_modules=['DEFAULT'], boft_dropout=0.0, boft_modules_to_save=[], vera_rank=256, vera_target_modules=['DEFAULT'], vera_projection_prng_key=0, vera_dropout=0.0, vera_d_initial=0.1, vera_modules_to_save=[], adapter_act='gelu', adapter_length=128, use_galore=False, galore_rank=128, galore_target_modules=None, galore_update_proj_gap=50, galore_scale=1.0, galore_proj_type='std', galore_optim_per_parameter=False, galore_with_embedding=False, adalora_target_r=8, adalora_init_r=12, adalora_tinit=0, adalora_tfinal=0, adalora_deltaT=1, adalora_beta1=0.85, adalora_beta2=0.85, adalora_orth_reg_weight=0.5, ia3_target_modules=['DEFAULT'], ia3_feedforward_modules=[], ia3_modules_to_save=[], llamapro_num_new_blocks=4, llamapro_num_groups=None, neftune_noise_alpha=None, neftune_backend='transformers', lisa_activated_layers=0, lisa_step_interval=20, gradient_checkpointing=True, deepspeed={'fp16': {'enabled': 'auto', 'loss_scale': 0, 'loss_scale_window': 1000, 'initial_scale_power': 16, 'hysteresis': 2, 'min_loss_scale': 1}, 'bf16': {'enabled': 'auto'}, 'optimizer': {'type': 'AdamW', 'params': {'lr': 'auto', 'betas': 'auto', 'eps': 'auto', 'weight_decay': 'auto'}}, 'scheduler': {'type': 'WarmupDecayLR', 'params': {'total_num_steps': 'auto', 'warmup_min_lr': 'auto', 'warmup_max_lr': 'auto', 'warmup_num_steps': 'auto'}}, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'none', 'pin_memory': True}, 'offload_param': {'device': 'none', 'pin_memory': True}, 'overlap_comm': True, 'contiguous_gradients': True, 'sub_group_size': 1000000000.0, 'reduce_bucket_size': 'auto', 'stage3_prefetch_bucket_size': 'auto', 'stage3_param_persistence_threshold': 'auto', 'stage3_max_live_parameters': 1000000000.0, 'stage3_max_reuse_distance': 1000000000.0, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_accumulation_steps': 'auto', 'gradient_clipping': 'auto', 'steps_per_print': 2000, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'wall_clock_breakdown': False}, batch_size=1, eval_batch_size=1, num_train_epochs=1, max_steps=-1, optim='adamw_torch', adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, learning_rate=0.0001, weight_decay=0.1, gradient_accumulation_steps=6, max_grad_norm=0.5, predict_with_generate=False, lr_scheduler_type='linear', warmup_ratio=0.05, eval_steps=50, save_steps=50, save_only_model=False, save_total_limit=2, logging_steps=5, acc_steps=1, dataloader_num_workers=1, dataloader_pin_memory=True, dataloader_drop_last=False, push_to_hub=False, hub_model_id=None, hub_token=None, hub_private_repo=False, push_hub_strategy='push_best', test_oom_error=False, disable_tqdm=False, lazy_tokenize=False, preprocess_num_proc=1, use_flash_attn=None, ignore_args_error=False, check_model_is_latest=True, logging_dir='/root/lh/output/runs', report_to=['tensorboard'], acc_strategy='token', save_on_each_node=True, evaluation_strategy='steps', save_strategy='steps', save_safetensors=True, gpu_memory_fraction=None, include_num_input_tokens_seen=False, local_repo_path=None, custom_register_path=None, custom_dataset_info=None, device_map_config_path=None, max_new_tokens=2048, do_sample=True, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.0, num_beams=1, fsdp='', fsdp_config=None, sequence_parallel_size=1, model_layer_cls_name=None, metric_warmup_step=0, fsdp_num=1, per_device_train_batch_size=None, per_device_eval_batch_size=None, eval_strategy=None, self_cognition_sample=0, train_dataset_mix_ratio=0.0, train_dataset_mix_ds=['ms-bench'], train_dataset_sample=-1, val_dataset_sample=None, safe_serialization=None, only_save_model=None, neftune_alpha=None, deepspeed_config_path=None, model_cache_dir=None, custom_train_dataset_path=[], custom_val_dataset_path=[]) [INFO:swift] Global seed set to 42 device_count: 2 rank: 0, local_rank: 0, world_size: 3, local_world_size: 2 [INFO:swift] Loading the model using model_dir: /mnt/model_repository/Qwen1.5-7B-Chat

Jamly7 commented 4 days ago

抱歉，终端输出有误，终端输出如下：

Jamly7 commented 4 days ago

run sh: torchrun --nproc_per_node 2 --nnodes 2 --node_rank 0 --master_addr 127.0.0.1 /root/lh/swift/swift/cli/sft.py --model_type qwen1half-7b-chat --model_id_or_path /mnt/model_repository/Qwen1.5-7B-Chat/ --dataset /root/lh/data2.jsonl --output_dir /root/lh/output/ --add_output_dir_suffix false --deepspeed default-zero3 --ddp_backend=nccl W0625 08:58:08.766000 140195703509632 torch/distributed/run.py:757] W0625 08:58:08.766000 140195703509632 torch/distributed/run.py:757] W0625 08:58:08.766000 140195703509632 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0625 08:58:08.766000 140195703509632 torch/distributed/run.py:757] 2024-06-25 08:58:43,813 - modelscope - INFO - PyTorch version 2.3.0 Found. 2024-06-25 08:58:43,813 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-06-25 08:58:43,829 - modelscope - INFO - PyTorch version 2.3.0 Found. 2024-06-25 08:58:43,829 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer 2024-06-25 08:58:43,839 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 59204b9b662d35f89093211a27a896f8 and a total number of 980 components indexed 2024-06-25 08:58:43,855 - modelscope - INFO - Loading done! Current index file version is 1.15.0, with md5 59204b9b662d35f89093211a27a896f8 and a total number of 980 components indexed [INFO:swift] Successfully registered /root/lh/swift/swift/llm/data/dataset_info.json [INFO:swift] Start time of running main: 2024-06-25 08:58:44.285717 [INFO:swift] Setting template_type: qwen [INFO:swift] Using deepspeed: {'fp16': {'enabled': 'auto', 'loss_scale': 0, 'loss_scale_window': 1000, 'initial_scale_power': 16, 'hysteresis': 2, 'min_loss_scale': 1}, 'bf16': {'enabled': 'auto'}, 'optimizer': {'type': 'AdamW', 'params': {'lr': 'auto', 'betas': 'auto', 'eps': 'auto', 'weight_decay': 'auto'}}, 'scheduler': {'type': 'WarmupDecayLR', 'params': {'total_num_steps': 'auto', 'warmup_min_lr': 'auto', 'warmup_max_lr': 'auto', 'warmup_num_steps': 'auto'}}, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'none', 'pin_memory': True}, 'offload_param': {'device': 'none', 'pin_memory': True}, 'overlap_comm': True, 'contiguous_gradients': True, 'sub_group_size': 1000000000.0, 'reduce_bucket_size': 'auto', 'stage3_prefetch_bucket_size': 'auto', 'stage3_param_persistence_threshold': 'auto', 'stage3_max_live_parameters': 1000000000.0, 'stage3_max_reuse_distance': 1000000000.0, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_accumulation_steps': 'auto', 'gradient_clipping': 'auto', 'steps_per_print': 2000, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'wall_clock_breakdown': False} [INFO:swift] Setting args.lazy_tokenize: False [INFO:swift] Setting args.dataloader_num_workers: 1 [2024-06-25 08:58:44,541] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.0), only 1.0.0 is known to be compatible [2024-06-25 08:58:44,729] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-25 08:58:44,785] [INFO] [comm.py:637:init_distributed] cdb=None [2024-06-25 08:58:44,785] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [INFO:swift] args: SftArguments(model_type='qwen1half-7b-chat', model_id_or_path='/mnt/model_repository/Qwen1.5-7B-Chat', model_revision='master', sft_type='lora', freeze_parameters=0.0, additional_trainable_parameters=[], tuner_backend='peft', template_type='qwen', output_dir='/root/lh/output', add_output_dir_suffix=False, ddp_backend='nccl', ddp_find_unused_parameters=None, ddp_broadcast_buffers=None, seed=42, resume_from_checkpoint=None, resume_only_model=False, ignore_data_skip=False, dtype='bf16', packing=False, dataset=['/root/lh/data2.jsonl'], val_dataset=[], dataset_seed=42, dataset_test_ratio=0.01, use_loss_scale=False, loss_scale_config_path='/root/lh/swift/swift/llm/agent/default_loss_scale_config.json', system=None, tools_prompt='react_en', max_length=2048, truncation_strategy='delete', check_dataset_strategy='none', model_name=[None, None], model_author=[None, None], quant_method=None, quantization_bit=0, hqq_axis=0, hqq_dynamic_config_path=None, bnb_4bit_comp_dtype='bf16', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, lora_target_modules=['q_proj', 'k_proj', 'v_proj'], lora_rank=8, lora_alpha=32, lora_dropout_p=0.05, lora_bias_trainable='none', lora_modules_to_save=[], lora_dtype='AUTO', lora_lr_ratio=None, use_rslora=False, use_dora=False, init_lora_weights='true', rope_scaling=None, boft_block_size=4, boft_block_num=0, boft_n_butterfly_factor=1, boft_target_modules=['DEFAULT'], boft_dropout=0.0, boft_modules_to_save=[], vera_rank=256, vera_target_modules=['DEFAULT'], vera_projection_prng_key=0, vera_dropout=0.0, vera_d_initial=0.1, vera_modules_to_save=[], adapter_act='gelu', adapter_length=128, use_galore=False, galore_rank=128, galore_target_modules=None, galore_update_proj_gap=50, galore_scale=1.0, galore_proj_type='std', galore_optim_per_parameter=False, galore_with_embedding=False, adalora_target_r=8, adalora_init_r=12, adalora_tinit=0, adalora_tfinal=0, adalora_deltaT=1, adalora_beta1=0.85, adalora_beta2=0.85, adalora_orth_reg_weight=0.5, ia3_target_modules=['DEFAULT'], ia3_feedforward_modules=[], ia3_modules_to_save=[], llamapro_num_new_blocks=4, llamapro_num_groups=None, neftune_noise_alpha=None, neftune_backend='transformers', lisa_activated_layers=0, lisa_step_interval=20, gradient_checkpointing=True, deepspeed={'fp16': {'enabled': 'auto', 'loss_scale': 0, 'loss_scale_window': 1000, 'initial_scale_power': 16, 'hysteresis': 2, 'min_loss_scale': 1}, 'bf16': {'enabled': 'auto'}, 'optimizer': {'type': 'AdamW', 'params': {'lr': 'auto', 'betas': 'auto', 'eps': 'auto', 'weight_decay': 'auto'}}, 'scheduler': {'type': 'WarmupDecayLR', 'params': {'total_num_steps': 'auto', 'warmup_min_lr': 'auto', 'warmup_max_lr': 'auto', 'warmup_num_steps': 'auto'}}, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'none', 'pin_memory': True}, 'offload_param': {'device': 'none', 'pin_memory': True}, 'overlap_comm': True, 'contiguous_gradients': True, 'sub_group_size': 1000000000.0, 'reduce_bucket_size': 'auto', 'stage3_prefetch_bucket_size': 'auto', 'stage3_param_persistence_threshold': 'auto', 'stage3_max_live_parameters': 1000000000.0, 'stage3_max_reuse_distance': 1000000000.0, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_accumulation_steps': 'auto', 'gradient_clipping': 'auto', 'steps_per_print': 2000, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'wall_clock_breakdown': False}, batch_size=1, eval_batch_size=1, num_train_epochs=1, max_steps=-1, optim='adamw_torch', adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, learning_rate=0.0001, weight_decay=0.1, gradient_accumulation_steps=4, max_grad_norm=0.5, predict_with_generate=False, lr_scheduler_type='linear', warmup_ratio=0.05, eval_steps=50, save_steps=50, save_only_model=False, save_total_limit=2, logging_steps=5, acc_steps=1, dataloader_num_workers=1, dataloader_pin_memory=True, dataloader_drop_last=False, push_to_hub=False, hub_model_id=None, hub_token=None, hub_private_repo=False, push_hub_strategy='push_best', test_oom_error=False, disable_tqdm=False, lazy_tokenize=False, preprocess_num_proc=1, use_flash_attn=None, ignore_args_error=False, check_model_is_latest=True, logging_dir='/root/lh/output/runs', report_to=['tensorboard'], acc_strategy='token', save_on_each_node=True, evaluation_strategy='steps', save_strategy='steps', save_safetensors=True, gpu_memory_fraction=None, include_num_input_tokens_seen=False, local_repo_path=None, custom_register_path=None, custom_dataset_info=None, device_map_config_path=None, max_new_tokens=2048, do_sample=True, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.0, num_beams=1, fsdp='', fsdp_config=None, sequence_parallel_size=1, model_layer_cls_name=None, metric_warmup_step=0, fsdp_num=1, per_device_train_batch_size=None, per_device_eval_batch_size=None, eval_strategy=None, self_cognition_sample=0, train_dataset_mix_ratio=0.0, train_dataset_mix_ds=['ms-bench'], train_dataset_sample=-1, val_dataset_sample=None, safe_serialization=None, only_save_model=None, neftune_alpha=None, deepspeed_config_path=None, model_cache_dir=None, custom_train_dataset_path=[], custom_val_dataset_path=[]) [INFO:swift] Global seed set to 42 device_count: 2 rank: 0, local_rank: 0, world_size: 4, local_world_size: 2 [INFO:swift] Loading the model using model_dir: /mnt/model_repository/Qwen1.5-7B-Chat [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.0), only 1.0.0 is known to be compatible [2024-06-25 08:58:44,981] [INFO] [comm.py:637:init_distributed] cdb=None device_count: 2 rank: 1, local_rank: 1, world_size: 4, local_world_size: 2

Jintao-Huang commented 4 days ago

你先单卡把模型加载到内存然后再使用多机多卡试试

Jamly7 commented 3 days ago

有命令么？我看教程没有写

Jamly7 commented 2 days ago

单卡把模型加载到内存指的是部署吗？（swift deploy），我先部署后启动主从节点的微调命令，依旧是卡在加载模型（[INFO:swift] Loading the model using model_dir: /mnt/model_repository/Qwen1.5-7B-Chat）

Jamly7 commented 2 days ago

多机多卡对硬件或者网络有什么要求吗？

Jamly7 commented 1 day ago

在经过30min的等待，time_out后，发现是socket链接失败，后续发现可能是网卡配置问题，加了两个配置后，顺利进入模型加载环节，export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_DISABLE=1；暂不清楚后续是否有其他问题

changqingla commented 1 day ago

在经过30min的等待，time_out后，发现是socket链接失败，后续发现可能是网卡配置问题，加了两个配置后，顺利进入模型加载环节，export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_DISABLE=1；暂不清楚后续

后续报错了吗，类似的问题，像是nccl端的问题

Jamly7 commented 1 day ago

后续有报错，在配置了NCCL_SOCKET_IFNAME后，进入训练流程时，nccl尝试连接主节点的36791socket，但是连接错误。

Jamly7 commented 1 day ago

NCCL WARN socketProgressOpt: Call to recv from 192.168.1.43<36791> failed : Broken pipe。 [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error, NCCL version 2.20.5 ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely. Last error: socketProgressOpt: Call to recv from 192.168.1.43<36791> failed : Broken pipe

modelscope / swift

多机多卡训练出现问题 #1222