Fine-tuning across 2 servers

The fine-tuning process of the GOT OCR model across two separate machines eventually results in a timeout

Command machine1:

CUDA_VISIBLE_DEVICES=0,1 NNODES=2 NODE_RANK=0 MASTER_ADDR=192.118.2.11 MASTER_PORT=23456 NPROC_PER_NODE=2 swift sft --model_type got-ocr2 --model_id_or_path stepfun-ai/GOT-OCR2_0 --sft_type lora --dataset /home/epop/DATASET/ds_1mil_rec/ds_1mil_rec/train_structured_data.jsonl --output_dir output --deepspeed default-zero2

Command machine2:

CUDA_VISIBLE_DEVICES=0,1 NNODES=2 NODE_RANK=1 MASTER_ADDR=192.118.2.11 MASTER_PORT=23456 NPROC_PER_NODE=2 swift sft --model_type got-ocr2 --model_id_or_path stepfun-ai/GOT-OCR2_0 --sft_type lora --dataset /home/epop/DATASET/ds_1mil_rec/ds_1mil_rec/train_structured_data.jsonl --output_dir output --deepspeed default-zero2

Once i start the scripts, i see no activity on the GPUs, the ports are open but nothing runs. Error message:

nohup: ignoring input
run sh: `/home/epop/anaconda3/envs/msswift_transformers4.45/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 23456 --nnodes 2 --node_rank 0 --master_addr 192.118.2.17 /home/epop/ms-swift/swift/cli/sft.py --model_type got-ocr2 --model_id_or_path stepfun-ai/GOT-OCR2_0 --sft_type lora --dataset /home/epop/DATASET/ds_paddle_1mil_rec/ds_paddle_1mil_rec/train_structured_data.jsonl --output_dir output --deepspeed default-zero3 --save_on_each_node true`
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[W1115 08:25:14.545635588 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=11, addr=[circle]:42438, remote=[circle]:23456) returned 0, likely a timeout
[W1115 08:25:14.548684749 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=11, addr=[circle]:42438, remote=[circle]:23456) timed out after 900000ms
Traceback (most recent call last):
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in <module>
    main()
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent
    result = agent.run()
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
    result = f(*args, **kwargs)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run
    result = self._invoke_run(role)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 849, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
    result = f(*args, **kwargs)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 668, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
    result = f(*args, **kwargs)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 513, in _rendezvous
    workers = self._assign_worker_ranks(
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
    result = f(*args, **kwargs)
  File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 590, in _assign_worker_ranks
    role_infos_bytes = store.multi_get(
torch.distributed.DistStoreError: wait timeout after 900000ms, keys: /none/torchelastic/role_info/0, /none/torchelastic/role_info/1

modelscope / ms-swift

Fine-tuning across 2 servers #2458