microsoft / UniVL

An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
https://arxiv.org/abs/2002.06353
MIT License
339 stars 54 forks source link

Error message (torch.distributed.elastic.multiprocessing.errors.ChildFailedError:) #44

Closed tingchihc closed 2 years ago

tingchihc commented 2 years ago

I am trying to test on own data. However, I got this error message. Can you help me fix it? thanks, Traceback (most recent call last): File "/home/tingchih/anaconda3/envs/py_univl/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/tingchih/anaconda3/envs/py_univl/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run )(*cmd_args) File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_task_caption.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-10-05_12:41:40 host : nlplab1 rank : 3 (local_rank: 3) exitcode : 1 (pid: 36124) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================