Air1000thsummer commented 6 months ago

I encountered an error while following the training code you provided,

"Python - m torch. distributed. run -- master_port 25764-- nproc_per'node=2 train. py -- exp_id retrain-a6000-- stage 03".

May I ask how to resolve this issue

WARNING:main:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

CUDA Device count: 4 CUDA Device count: 4 Traceback (most recent call last): File "/home/hyf/WorkPlace/XMemWorkShop/main-XMem/train.py", line 36, in repo = git.Repo(".") File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/git/repo/base.py", line 276, in init raise InvalidGitRepositoryError(epath) git.exc.InvalidGitRepositoryError: /home/hyf/WorkPlace/XMemWorkShop/main-XMem Traceback (most recent call last): File "/home/hyf/WorkPlace/XMemWorkShop/main-XMem/train.py", line 36, in repo = git.Repo(".") File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/git/repo/base.py", line 276, in init raise InvalidGitRepositoryError(epath) git.exc.InvalidGitRepositoryError: /home/hyf/WorkPlace/XMemWorkShop/main-XMem ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1066960) of binary: /home/hyf/anaconda3/envs/xmem-repro/bin/python Traceback (most recent call last): File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/run.py", line 728, in main() File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run elastic_launch( File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2024-03-06_20:58:15 host : user-MD72-HB3-00 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1066961) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-06_20:58:15 host : user-MD72-HB3-00 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1066960) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

hkchengrex commented 6 months ago

You would need to clone the repo (git clone) or initialize git in the directory.

Air1000thsummer commented 6 months ago

Thank you for your reply, the problem has been solved

hkchengrex / XMem

An error occurred while using the training command #138

train.py FAILED

Failures: [1]: time : 2024-03-06_20:58:15 host : user-MD72-HB3-00 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1066961) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-06_20:58:15 host : user-MD72-HB3-00 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1066960) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html