RuntimeError: shape '[4, -1, 1, 512, 4, 4]' is invalid for input of size 16384

yuliangguo commented 1 year ago

It would be helpful to confirm if this a commen issue. Due to limited resources, I have to use batch_size 8, and when I run python run.py --dataset shapenet_cars --path_length_regularization --gpus 4 --batch_size 8, the training can start running for a while and some time later lead this error. This error happens both for p3d_cars and shapenet_cars as tested, so that it might not due to data issue.

The full error message is copied below.

DF pre-training done. Traceback (most recent call last): File "run.py", line 983, in discriminated = target_discriminator(img_batch, i, File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/data_ssd/guo1syv/Projects/nerf-from-image/models/discriminator.py", line 80, in forward return self.backbone(x, cond) File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, kwargs) File "/data_ssd/guo1syv/Projects/nerf-from-image/models/stylegan.py", line 672, in forward x = self.b4(x, cmap) File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/data_ssd/guo1syv/Projects/nerf-from-image/models/stylegan.py", line 597, in forward x = self.mbstd(x) File "/data_ssd/guo1syv/anaconda3/envs/nerf-from-image/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/data_ssd/guo1syv/Projects/nerf-from-image/models/stylegan.py", line 556, in forward y = x.reshape(ng, -1, f, nc, h, w) RuntimeError: shape '[4, -1, 1, 512, 4, 4]' is invalid for input of size 16384

dariopavllo commented 1 year ago

Hi,

This is caused by the MinibatchStd layer in the StyleGAN2 discriminator. The batch size (per-GPU) must be divisible by 4. With 4 GPUs, you should aim for a total batch size of at least 16 samples. If you cannot reach it, you can try to comment out that layer.

Anyway, training might be unstable with such a small batch size. In that case, you might also want to increase the strength of the R1 regularization.

yuliangguo commented 1 year ago

Thanks a lot for replying. Another option I found is to change mbstd_group_size=2. Will proceed to check if there is stability issue with training.

google-research / nerf-from-image

RuntimeError: shape '[4, -1, 1, 512, 4, 4]' is invalid for input of size 16384 #1