Once i start the scripts, i see no activity on the GPUs, the ports are open but nothing runs.
Error message:
nohup: ignoring input
run sh: `/home/epop/anaconda3/envs/msswift_transformers4.45/bin/python -m torch.distributed.run --nproc_per_node 2 --master_port 23456 --nnodes 2 --node_rank 0 --master_addr 192.118.2.17 /home/epop/ms-swift/swift/cli/sft.py --model_type got-ocr2 --model_id_or_path stepfun-ai/GOT-OCR2_0 --sft_type lora --dataset /home/epop/DATASET/ds_paddle_1mil_rec/ds_paddle_1mil_rec/train_structured_data.jsonl --output_dir output --deepspeed default-zero3 --save_on_each_node true`
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[W1115 08:25:14.545635588 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=11, addr=[circle]:42438, remote=[circle]:23456) returned 0, likely a timeout
[W1115 08:25:14.548684749 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=11, addr=[circle]:42438, remote=[circle]:23456) timed out after 900000ms
Traceback (most recent call last):
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in <module>
main()
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent
result = agent.run()
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run
result = self._invoke_run(role)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 849, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 668, in _initialize_workers
self._rendezvous(worker_group)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 513, in _rendezvous
workers = self._assign_worker_ranks(
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
File "/home/epop/anaconda3/envs/msswift_transformers4.45/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 590, in _assign_worker_ranks
role_infos_bytes = store.multi_get(
torch.distributed.DistStoreError: wait timeout after 900000ms, keys: /none/torchelastic/role_info/0, /none/torchelastic/role_info/1
The fine-tuning process of the GOT OCR model across two separate machines eventually results in a timeout
Command machine1:
Command machine2:
Once i start the scripts, i see no activity on the GPUs, the ports are open but nothing runs. Error message: