I installed rtdetrv2_pytorch with requirements text but It didn't work in the below code
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config --use-amp --seed=0 &> log.txt 2>&1 &
it gives error and changed as
torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config --use-amp --seed=0
`(rdetr1) C:\Users\sdurmus>torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config --use-amp --se
ed=0 > log.txt
W0803 15:25:11.983889 2408 torch\distributed\elastic\multiprocessing\redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
W0803 15:25:12.013218 2408 torch\distributed\run.py:757]
W0803 15:25:12.013218 2408 torch\distributed\run.py:757]
W0803 15:25:12.013218 2408 torch\distributed\run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0803 15:25:12.013218 2408 torch\distributed\run.py:757]
[W socket.cpp:697] [c10d] The client socket has failed to connect to [SDT]:9909 (system error: 10049 - Ā¦stenen adres iļ¢¼eriĀinde geļ¢¼erli deĀil.).
C:\Users\sdurmus\anaconda3\envs\rdetr1\python.exe: can't open file 'C:\Users\sdurmus\tools\train.py': [Errno 2] No such file or directory
C:\Users\sdurmus\anaconda3\enC:\Users\sdurmus\anaconda3\envs\rdetr1\python.exe: can't open file 'C:vs\rdetr1\python.exe: can't open file 'C:\Users\sdurmus\tools\train.py': [Errno 2] No such file or directory
\Users\sdurmus\anaconda3\envs\rdetr1\python.exe: can't open file 'C:\Users\sdurmus\tooC:\Users\sdurmus\tools\train.py': [Errno 2] No such file or directory
ls\train.py': [Errno 2] No such file or directory
E0803 15:25:17.050683 2408 torch\distributed\elastic\multiprocessing\api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 17064) of binary: C:\Users\sdurmus\anaconda3\envs\rdetr1\python.exe
Traceback (most recent call last):
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\Scripts\torchrun.exe__main.py", line 7, in
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\elastic\multiprocessing\errors__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\run.py", line 879, in main
run(args)
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\run.py", line 870, in run
elastic_launch(
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\launcher\api.py", line 132, in call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\launcher\api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Hello,
I installed rtdetrv2_pytorch with requirements text but It didn't work in the below code
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config --use-amp --seed=0 &> log.txt 2>&1 &
it gives error and changed astorchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config --use-amp --seed=0
`(rdetr1) C:\Users\sdurmus>torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config --use-amp --se ed=0 > log.txt W0803 15:25:11.983889 2408 torch\distributed\elastic\multiprocessing\redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs. W0803 15:25:12.013218 2408 torch\distributed\run.py:757] W0803 15:25:12.013218 2408 torch\distributed\run.py:757] W0803 15:25:12.013218 2408 torch\distributed\run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0803 15:25:12.013218 2408 torch\distributed\run.py:757] [W socket.cpp:697] [c10d] The client socket has failed to connect to [SDT]:9909 (system error: 10049 - Ā¦stenen adres iļ¢¼eriĀinde geļ¢¼erli deĀil.). C:\Users\sdurmus\anaconda3\envs\rdetr1\python.exe: can't open file 'C:\Users\sdurmus\tools\train.py': [Errno 2] No such file or directory C:\Users\sdurmus\anaconda3\enC:\Users\sdurmus\anaconda3\envs\rdetr1\python.exe: can't open file 'C:vs\rdetr1\python.exe: can't open file 'C:\Users\sdurmus\tools\train.py': [Errno 2] No such file or directory \Users\sdurmus\anaconda3\envs\rdetr1\python.exe: can't open file 'C:\Users\sdurmus\tooC:\Users\sdurmus\tools\train.py': [Errno 2] No such file or directory ls\train.py': [Errno 2] No such file or directory E0803 15:25:17.050683 2408 torch\distributed\elastic\multiprocessing\api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 17064) of binary: C:\Users\sdurmus\anaconda3\envs\rdetr1\python.exe Traceback (most recent call last): File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\sdurmus\anaconda3\envs\rdetr1\Scripts\torchrun.exe__main.py", line 7, in
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\elastic\multiprocessing\errors__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\run.py", line 879, in main
run(args)
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\run.py", line 870, in run
elastic_launch(
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\launcher\api.py", line 132, in call__
return launch_agent(self._config, self._entrypoint, list(args))
File "C:\Users\sdurmus\anaconda3\envs\rdetr1\lib\site-packages\torch\distributed\launcher\api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures: [1]: time : 2024-08-03_15:25:17 host : SDT rank : 1 (local_rank: 1) exitcode : 2 (pid: 12492) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-08-03_15:25:17 host : SDT rank : 2 (local_rank: 2) exitcode : 2 (pid: 9616) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-08-03_15:25:17 host : SDT rank : 3 (local_rank: 3) exitcode : 2 (pid: 22120) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-08-03_15:25:17 host : SDT rank : 0 (local_rank: 0) exitcode : 2 (pid: 17064) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html`