Closed lxww302 closed 6 months ago
Is that the full error trace? It seems like some of it may be cut off
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I can run single-node training without any problem. But when I switch to multi-node training, it failed immediately.
my command:
accelerate launch --config_file=/opt/tiger/alignment/accelerate_configs/deepspeed_zero3.yaml --num_machines 2 --machine_rank 0 --num_processes 16 --main_process_ip 10.124.167.213 --main_process_port 9686 sft/sft.py --model_name_or_path=model_dir --dataset_name=data_dir --per_device_train_batch_size=1 --output_dir=model_save --bf16 --save_total_limit=5 --warmup_steps=500 --save_steps=1000 --max_seq_length=2048 --attn_implementation=flash_attention_2 --neftune_noise_alpha=5
error messages: