When I deployed according to README, I encountered this issue. I'm not quite sure what caused this. Following is the code snippet and error log of my implementation. Please take a look at that and suggest me a solution. @theEricMa
Loading ResNet ArcFace
loading id loss module: <All keys matched successfully>
Loading ResNet ArcFace
loading id loss module: <All keys matched successfully>
Loss perceptual_inverse_lr Weight 1.0
Loss perceptual_inverse_sr Weight 1.0
Loss perceptual_refine_lr Weight 1.0
Loss perceptual_refine_sr Weight 1.0
Loss monotonic Weight 1.0
Loss TV Weight 1.0
Loss pixel Weight 1
Loss a_norm Weight 0.0
Loss a_mutual Weight 0.0
Loss local Weight 10.0
Loss local_s Weight 10.0
Loss id Weight 1.0
Loss id_s Weight 1.0
We train Generator
load [net_Warp] and [net_Warp_ema] from result/otavatar/epoch_00005_iteration_000002000_checkpoint.pt
Done with loading the checkpoint.
0%| | 0/19 [00:00<?, ?it/sSetting up PyTorch plugin "bias_act_plugin"... Done. | 0/3537 [00:00<?, ?it/s]
Setting up PyTorch plugin "upfirdn2d_plugin"... Done.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:22<00:00, 4.51it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:16<00:00, 6.12it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3537/3537 [06:13<00:00, 9.48it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 150437) of binary: /data1/anconda3/envs/otavatar/bin/python█████████▉| 3536/3537 [06:13<00:00, 11.93it/s]
Traceback (most recent call last):
File "/data1/anconda3/envs/otavatar/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data1/anconda3/envs/otavatar/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data1/anconda3/envs/otavatar/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
inference_refine_1D_cam.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-05-23_21:02:27
host : zss-Precision-5820-Tower-X-Series
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 150437)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 150437
=======================================================
Hi, to find a solution, please take a look at the related blog or issues. Maybe the computer is struggling to load a large amount of data into memory. To check the memory usage, you can use the htop command, which may display something like this.
When I deployed according to README, I encountered this issue. I'm not quite sure what caused this. Following is the code snippet and error log of my implementation. Please take a look at that and suggest me a solution. @theEricMa