Closed yuliangguo closed 1 year ago
Hi,
This is caused by the MinibatchStd layer in the StyleGAN2 discriminator. The batch size (per-GPU) must be divisible by 4. With 4 GPUs, you should aim for a total batch size of at least 16 samples. If you cannot reach it, you can try to comment out that layer.
Anyway, training might be unstable with such a small batch size. In that case, you might also want to increase the strength of the R1 regularization.
Thanks a lot for replying. Another option I found is to change mbstd_group_size=2. Will proceed to check if there is stability issue with training.
It would be helpful to confirm if this a commen issue. Due to limited resources, I have to use batch_size 8, and when I run
python run.py --dataset shapenet_cars --path_length_regularization --gpus 4 --batch_size 8
, the training can start running for a while and some time later lead this error. This error happens both for p3d_cars and shapenet_cars as tested, so that it might not due to data issue.The full error message is copied below.
DF pre-training done. Traceback (most recent call last): File "run.py", line 983, in
discriminated = target_discriminator(img_batch, i,
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, *kwargs)
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, kwargs)
File "/data_ssd/guo1syv/Projects/nerf-from-image/models/discriminator.py", line 80, in forward
return self.backbone(x, cond)
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/data_ssd/guo1syv/Projects/nerf-from-image/models/stylegan.py", line 672, in forward
x = self.b4(x, cmap)
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, *kwargs)
File "/data_ssd/guo1syv/Projects/nerf-from-image/models/stylegan.py", line 597, in forward
x = self.mbstd(x)
File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, kwargs)
File "/data_ssd/guo1syv/Projects/nerf-from-image/models/stylegan.py", line 556, in forward
y = x.reshape(ng, -1, f, nc, h, w)
RuntimeError: shape '[4, -1, 1, 512, 4, 4]' is invalid for input of size 16384