ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 150437) of binary: /data1/anconda3/envs/otavatar/bin/python

wangxuanx commented 1 year ago

When I deployed according to README, I encountered this issue. I'm not quite sure what caused this. Following is the code snippet and error log of my implementation. Please take a look at that and suggest me a solution. @theEricMa

Loading ResNet ArcFace
loading id loss module: <All keys matched successfully>
Loading ResNet ArcFace
loading id loss module: <All keys matched successfully>
Loss perceptual_inverse_lr Weight 1.0
Loss perceptual_inverse_sr Weight 1.0
Loss perceptual_refine_lr Weight 1.0
Loss perceptual_refine_sr Weight 1.0
Loss monotonic            Weight 1.0
Loss TV                   Weight 1.0
Loss pixel                Weight 1
Loss a_norm               Weight 0.0
Loss a_mutual             Weight 0.0
Loss local                Weight 10.0
Loss local_s              Weight 10.0
Loss id                   Weight 1.0
Loss id_s                 Weight 1.0
We train Generator
load [net_Warp] and [net_Warp_ema] from result/otavatar/epoch_00005_iteration_000002000_checkpoint.pt
Done with loading the checkpoint.
  0%|                                                                                                                                                                           | 0/19 [00:00<?, ?it/sSetting up PyTorch plugin "bias_act_plugin"... Done.                                                                                                                           | 0/3537 [00:00<?, ?it/s]
                          Setting up PyTorch plugin "upfirdn2d_plugin"... Done.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:22<00:00,  4.51it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:16<00:00,  6.12it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3537/3537 [06:13<00:00,  9.48it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 150437) of binary: /data1/anconda3/envs/otavatar/bin/python█████████▉| 3536/3537 [06:13<00:00, 11.93it/s]
Traceback (most recent call last):
  File "/data1/anconda3/envs/otavatar/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data1/anconda3/envs/otavatar/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
inference_refine_1D_cam.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-23_21:02:27
  host      : zss-Precision-5820-Tower-X-Series
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 150437)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 150437
=======================================================

theEricMa commented 1 year ago

Hi, to find a solution, please take a look at the related blog or issues. Maybe the computer is struggling to load a large amount of data into memory. To check the memory usage, you can use the htop command, which may display something like this.

904763189cy commented 11 months ago

@wangxuanx hi, Have you solved this problem? I also encountered the same problem.

theEricMa / OTAvatar

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 150437) of binary: /data1/anconda3/envs/otavatar/bin/python #13