Closed haoshuai714 closed 2 years ago
Hi, I haven't met this problem, and the error message does not seem to point to any part of the pretraining code.
Thanks! Maybe python and pytorch version not match problem; Could you provide Requirements file, such as python version, torch version,ect. Thank you!
This code has been tested on Python 3.8 and pytorch 1.09.
I have a problem at pretrain phase, such as:
Traceback (most recent call last):
File "Pretrain.py", line 215, in
File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Have you ever had a similar problem? Thanks!
I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)
File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Have you ever had a similar problem? Thanks!
check your python and pytorch version!
I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args) File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed. Have you ever had a similar problem? Thanks!
check your python and pytorch version!
My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong? Thanks for your answer.
I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args) File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed. Have you ever had a similar problem? Thanks!
check your python and pytorch version!
My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong? Thanks for your answer.
This author version: This code has been tested on Python 3.8 and pytorch 1.09.
If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!
If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!
Thanks. Would you please tell me your e-mails and I'll sent the document to you.
If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!
Thanks. Would you please tell me your e-mails and I'll sent the document to you.
haoxiaoshuai@iie.ac.cn
If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!
Thanks. Would you please tell me your e-mails and I'll sent the document to you.
I have already sent the document to your e-mail. Looking forward to your reply. Thank you!
I have a problem at pretrain phase, when the program is run by half, such as: WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6594 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6595 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6596 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6597 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6598 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6599 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6600 closing signal SIGTERM Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run result = self._invoke_run(role) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run time.sleep(monitor_interval) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(cmd_args)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 716, in run
self._shutdown(e.sigval)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown
self._pcontext.close(death_sig)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close
self._close(death_sig=death_sig, timeout=timeout)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 709, in _close
if handler.proc.poll() is None:
File "/usr/lib/python3.6/subprocess.py", line 875, in poll
return self._internal_poll()
File "/usr/lib/python3.6/subprocess.py", line 1403, in _internal_poll
pid, sts = _waitpid(self.pid, _WNOHANG)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1
Have you ever had a similar problem?