Open himanshu-daga opened 3 years ago
I also have a similar problem, I would be interested to know the changes you made to loos function to avoid the black images
Hi @anilesec,
By default we consider all the pixels for taking RGB loss. For starters, simply consider only pixels that have RGB_values > 0 in the ground truth image for loss:
MSE(predicted_RGB, target_RGB + ( (target_RGB==0) x predicted_RGB) )
Hi @himanshu-daga Thanks for the response. If it makes sense, I think it would make the model biased or force the model to learn features from pixels that are non-zero. Did you check the MSE values when it was producing black image? In my case, MSE value 'NaN'.
Yes, I did check them and they were all 0 (not nan). And I agree that it might make model a bit biased, but that's something to be resolved later, I guess. First and foremost we need to be able to see some results so that we know where we are & how far we need to go, all black imgs don't help.
Hi, thanks for the questions and sorry for very slow response.
model used to produce all black results
Th's a good observation and is why we had to use bounding-box sampling for much of the training. Learning rate also has an effect as you observed.
FYI This is a general problem for NeRF, which could be due to the ReLU applied to the sigma output (because once sigma goes <= 0, that point gets 0 gradient). More recently people have started to use SoftPlus on sigma which seems to remove the issue.
After training for 50k iterations, the model is able to predict reasonable views from the 8 input poses but it is absolutely noisy when rendered from azimuths which are not part of training data
We definitely see this kind of effect where the views closest to input view are much clearer. However I am not really sure what you're doing here - training using 6 objects probably isn't enough for generalizing, and you shouldn't be using all views as input poses simultaneously (then it would be trivial for the network to learn to reconstruct them). I'm not sure if that's what you're doing, but if so I would recommend using 1-3 views as input at a time and the rest as target views.
@sxyu What is this bounding box sampling?
Hi @sxyu,
I was really amazed by this work and decided to implement my own version drawing inspiration from pixel nerf & keeping nerf as the base. After a lot of iterations & experiments, I've reached a level where the model is starting to show some reasonable results on training data. But the results look only good from the input poses that the model is trained on and doesn't learn the 3d geometry.
I'm training a model from scratch using 6 objects, with 8 views each -> [0,45,90,135,180,225,270,315] degrees as azimuth & 0 as elevation. After training for 50k iterations, the model is able to predict reasonable views from the 8 input poses but it is absolutely noisy when rendered from azimuths which are not part of training data. And when I pass in a test image as input, the model simply produces a noisy combination of all the 6 input images i.e. it is not at all generalizing. Can you kindly help me resolve these consistency & generalisability issues?
Thanks!
PS- Earlier I was stuck where the model used to produce all black results. I resolved this by reducing the learning rate & by modifying the MSE so that it gives less weight to background (black) pixels in the ground truth image. I hope this doesn't cause any issues with training & generalizing.