theEricMa / OTAvatar

This is the official repository for OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering [CVPR2023].
306 stars 37 forks source link

Multi-GPU trainning error (RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3) #28

Closed szh-bash closed 6 months ago

szh-bash commented 6 months ago

@theEricMa theEricMa My server environment can run single-gpu training, but encounters the following issues when executing multi-gpu training tasks. After changing nproc_per_node from 1 to 4, this error occurred.

(otavatar) ➜  OTAvatar git:(main) ✗ CUDA_VISIBLE_DEVICES=2,3,4,5 python -m torch.distributed.launch --nproc_per_node=4 --master_port 12222 train_inversion.py --config ./config/otavatar.yaml --name otavatar_gpu4
...
loading id loss module: <All keys matched successfully>
loading id loss module: <All keys matched successfully>
Loss perceptual_inverse_lr Weight 1.0
Loss perceptual_inverse_sr Weight 1.0
Loss perceptual_refine_lr Weight 1.0
Loss perceptual_refine_sr Weight 1.0
Loss monotonic            Weight 1.0
Loss TV                   Weight 1.0
Loss pixel                Weight 1
Loss a_norm               Weight 0.0
Loss a_mutual             Weight 0.0
Loss local                Weight 10.0
Loss local_s              Weight 10.0
Loss id                   Weight 1.0
Loss id_s                 Weight 1.0
loading id loss module: <All keys matched successfully>
loading id loss module: <All keys matched successfully>
Loading model from: /gpfsdata/home/x/OTAvatar/third_part/PerceptualSimilarity/weights/v0.1/alex.pth
We train Generator
Loading model from: /gpfsdata/home/x/OTAvatar/third_part/PerceptualSimilarity/weights/v0.1/alex.pth
We train Generator
No checkpoint found.
Epoch 0 ...
Loading model from: /gpfsdata/home/x/OTAvatar/third_part/PerceptualSimilarity/weights/v0.1/alex.pth
We train Generator
Loading model from: /gpfsdata/home/x/OTAvatar/third_part/PerceptualSimilarity/weights/v0.1/alex.pth
We train Generator

  0%|          | 0/2 [00:00<?, ?it/s]Setting up PyTorch plugin "bias_act_plugin"... Setting up PyTorch plugin "bias_act_plugin"... Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "upfirdn2d_plugin"... Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

  0%|          | 0/100 [00:00<?, ?it/s]Done.
Setting up PyTorch plugin "upfirdn2d_plugin"... Setting up PyTorch plugin "upfirdn2d_plugin"... Done.
Done.
Done.
Done.
Traceback (most recent call last):
  File "/gpfsdata/home/x/OTAvatar/loss/identity.py", line 353, in forward
    loss = criterion(self.facenet(gt_align).detach(), self.facenet(pred_align))
  File "/gpfsdata/home/x/miniconda3/envs/otavatar/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpfsdata/home/x/miniconda3/envs/otavatar/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
    raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3
...

Full log here err.log What could be the possible reasons?

szh-bash commented 6 months ago

Incorrect handling parameters of arcface_resnet.pth ( 'module.xx.xxx') with the code I added a few days ago. self.facenet = nn.DataParallel(self.facenet) # modified to module.weight (loss/identity.py) delete this line and deal with 'module' in arcface_resnet.pth correctlly, everything works again now.