Initialization issue of training both coarse and fine together

kwea123 / nerf_pl

NeRF (Neural Radiance Fields) and NeRF in the Wild using pytorch-lightning

https://www.youtube.com/playlist?list=PLDV2CyUo4q-K02pNEyDr7DYpTQuka3mbV

MIT License

2.74k stars 483 forks source link

Initialization issue of training both coarse and fine together #51

Closed wbjang closed 3 years ago

wbjang commented 3 years ago

Hi, my name is Wonbong Jang, I am working on NeRF based on your implementation. Thank you for sharing the great work!

I've tried with the tiny digger (From tiny_nerf implementation of the original repo - https://github.com/bmild/nerf) - it has the black background and 100 x 100.

When I trained using both coarse and fine together (N_importance = 128), only 20% of the time, both networks are optimized. In other 40% of the times, only one of them is trained (coarse or fine), and other 40% of the time, none of them are optimized. The learning rate is 5e-4 with Adam Optimizer, and it usually works well when I train the coarse model only.

I think this is probably due to the initialization issue. I am wondering if you had any kinds of the above issue before, and it would be appreciated if you could provide any insights on this.

Kind regards,

Wonbong

kwea123 commented 3 years ago

Yes, training on blender dataset is tricky because it contains a high portion of white background. You wrote black background, is it a typo? As I pointed out here, I suspect that this is due to relu activation with bad initialization that makes sigma always zero initially. A fix would be to replace torch.relu here https://github.com/kwea123/nerf_pl/blob/cff91ec73893b8dddf6e0e02bec104c2e6fe4bf5/models/rendering.py#L138 with torch.nn.Softplus()(sigmas+noise) to make sigma always positive.

It is just an hypothesis. I'd like to have some statistics like what you have done. Please tell me if this change works, and if possible, like what you have described above, how much of the time it works. Thanks in advance!

wbjang commented 3 years ago

Thank you for your reply! Actually, it was a black background - I still look into this issue.

I also think using the Softplus may be a good idea - I have tried SoftPlus on the white background, and to me, PSNR seems to increase a little(NeRF coarse only). I will test with the NeRF coarse and fine together, and see whether this improves the results.

kwea123 commented 3 years ago

Oh, yes, in that example the data is of black background. It is just a matter of whether adding 1 later here: https://github.com/kwea123/nerf_pl/blob/df6810aa0e31b42269a9ce448e06cc09713173a5/models/rendering.py#L156-L157

If you use black background, just pass white_back=False or don't add this line.

The data with white background can be found https://drive.google.com/drive/folders/128yBriW1IG_3NJ5Rp7APSTZsJqdJdfc1

Considering softplus, I'd like to know if it stabilize the initial training consistently.

wbjang commented 3 years ago

Hello! Thank you for your reply.

I've tried with my own dataset - the renderings of the ShapeNet Chairs with the white background.

First, I've found out that the training instability issue was due to the weights.detach() -

zvals = sample_pdf(z_vals_mid, weights_coarse[:, 1:-1], N_importance, det=(perturb==0)).detach() # from https://github.com/kwea123/nerf_pl/blob/master/models/rendering.py

for importance sampling, weights_coarse[:, 1:-1].clone().detach() performs more stable. That is why even with the black background, the fine and coarse model s were not trained together.

Second, I've tried SoftPlus instead of ReLU on images with the white background(ShapeNet chairs) - in this case, it works similar to ReLU or sometimes worse. (I've seen some failure cases for a coarse model, though the fine model works fine) However, I will try more on other datasets - if Softplus could improve the training.

Cheers,

Wonbong

kwea123 commented 3 years ago

I do not understand why .clone().detach() works "more stable", for me detach() is enough and it can be done either outside or inside. Anyway theoretically they should be doing exactly the same thing. It's somewhat bothersome to prove that they are also practically equal though... You said "more stable", does my original code fail very often, and by changing to weights_coarse[:, 1:-1].clone().detach(), it always succeeds?

kwea123 commented 3 years ago

https://discuss.pytorch.org/t/tensor-clone-detach-vs-tensor-detach/78695 From this post, detach and clone.detach should be the same, and detach costs less memory.

YZsZY commented 1 year ago

Hello! You know why need to set white_back=True? Thanks a lot!