salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.53k stars 195 forks source link

Pretrain phase problem #45

Closed haoshuai714 closed 2 years ago

haoshuai714 commented 2 years ago

I have a problem at pretrain phase, when the program is run by half, such as: WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6593 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6594 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6595 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6596 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6597 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6598 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6599 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6600 closing signal SIGTERM Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run result = self._invoke_run(role) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run time.sleep(monitor_interval) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run )(cmd_args) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent result = agent.run() File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper result = f(args, **kwargs) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/api.py", line 716, in run self._shutdown(e.sigval) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 190, in _shutdown self._pcontext.close(death_sig) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 330, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 709, in _close if handler.proc.poll() is None: File "/usr/lib/python3.6/subprocess.py", line 875, in poll return self._internal_poll() File "/usr/lib/python3.6/subprocess.py", line 1403, in _internal_poll pid, sts = _waitpid(self.pid, _WNOHANG) File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 6523 got signal: 1

Have you ever had a similar problem?

LiJunnan1992 commented 2 years ago

Hi, I haven't met this problem, and the error message does not seem to point to any part of the pretraining code.

haoshuai714 commented 2 years ago

Thanks! Maybe python and pytorch version not match problem; Could you provide Requirements file, such as python version, torch version,ect. Thank you!

LiJunnan1992 commented 2 years ago

This code has been tested on Python 3.8 and pytorch 1.09.

Junjie-Ye commented 2 years ago

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)

File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Have you ever had a similar problem? Thanks!

haoshuai714 commented 2 years ago

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args)

File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Have you ever had a similar problem? Thanks!

check your python and pytorch version!

Junjie-Ye commented 2 years ago

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args) File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed. Have you ever had a similar problem? Thanks!

check your python and pytorch version!

My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong? Thanks for your answer.

haoshuai714 commented 2 years ago

I have a problem at pretrain phase, such as: Traceback (most recent call last): File "Pretrain.py", line 215, in Traceback (most recent call last): File "Pretrain.py", line 215, in main(args, config) File "Pretrain.py", line 93, in main main(args, config)utils.init_distributed_mode(args) File "Pretrain.py", line 93, in main File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode utils.init_distributed_mode(args) File "/root/albef/ALBEF/utils.py", line 257, in init_distributed_mode torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier torch.distributed.barrier() File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8 ncclSystemError: System call (socket, malloc, munmap, etc) failed. Have you ever had a similar problem? Thanks!

check your python and pytorch version!

My python version is 3.7 and my torch version is 1.8.0. Is there anything wrong? Thanks for your answer.

This author version: This code has been tested on Python 3.8 and pytorch 1.09.

haoshuai714 commented 2 years ago

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Junjie-Ye commented 2 years ago

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

haoshuai714 commented 2 years ago

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

haoxiaoshuai@iie.ac.cn

Junjie-Ye commented 2 years ago

If you could provide a "environment.yml" file, which contains "Install dependecies"? such as : https://github.com/jonmun/EPIC-KITCHENS-100_UDA_TA3N/blob/main/environment.yml Thanks!

Thanks. Would you please tell me your e-mails and I'll sent the document to you.

haoxiaoshuai@iie.ac.cn

I have already sent the document to your e-mail. Looking forward to your reply. Thank you!