Closed shawnsya closed 1 year ago
Hi. Could you provide more details about the scene you used? We have runned our code for two times for each scene but did not observe any errors. Did you use your custom data?
Hi. Could you provide more details about the scene you used? We have runned our code for two times for each scene but did not observe any errors. Did you use your custom data?
I also ran into this issue,it seems that this problem only occurs on mipnerf360,and it works fine on other models I'm using data downloaded from http://storage.googleapis.com/gresearch/refraw360/360_v2.zip
@cococolorful Oh. We should fix this issue if this error exists. To be specific, can I ask you about the specific scenes' name you've run? We should first reproduce the bug to actually find where the problem first caused.
@cococolorful Oh. We should fix this issue if this error exists. To be specific, can I ask you about the specific scenes' name you've run? We should first reproduce the bug to actually find where the problem first caused.
@jeongyw12382 Thanks for your reply! In fact, for all the scenarios in the dataset “nerf_360_v2”, I had the NAN problem at the very beginning of training.:sob:.And the dataset can run not only on other models of NeRF-Factor, but also on multinerf. The following figure is a screenshot of the training garden scene, and other scenes are the same as this one. Since I only have a 3090 graphics card, I adjusted the batch_size to 1024.
I can provide my environment configuration if this is useful.
In fact, I have been trying to configure the environment directly with conda env create --file nerf_factory.yml
, but my side has not been successful due to network issues.:dizzy_face:.I'll keep trying.
Finally, thank you for your excellent work!Clear code logic helps me understand the paper, thank you for your contributions!:cupid:
Hi, I also run into "loss/psnr" nan issue on mipnerf360/360_v2 training. I'm using 3070Ti with 8GB memory.
https://github.com/kakaobrain/NeRF-Factory/blob/main/src/model/mipnerf360/model.py#L126
raw_density = self.density_layer(x)[..., 0]
the raw_density values returned from MLP are NaN
"
MLP predict_density raw_density=tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
grad_fn=
The memory usage is about 86% during the training. |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 47% 54C P2 208W / 290W | 5152MiB / 8192MiB | 86% Default | | | | N/A
If reduce the training batch size to 512, then the training can be performed successfully.
Hmm.. in my opinion, if this issue is observed also in the original multinerf implementation, it might indicate that the MipNeRF360 requires a large batch size to successfully train the model or MipNeRF360 is sensitive to hyperparmeters such as bsz. Since our implementation is a 're-implementation' based on the original implementation, we could not address this issue. Accroding to the reply by @zongwave, adjusting hyperparameters could address this issue.
Let us know if you need more helps for this issue. If so, please re-open the issue.
Hi!I adjusted the batch_size in all configs files to 512 and it solved the problem of loss=NaN.
While training MipNeRF360 on dataset nerf_360_v2 and it turned out loss nan Config as followed:
360-v2 Specific Arguments
run.dataset_name = "nerf_360_v2" run.datadir = "data/nerf_360_v2"
run.dataset_name = "nerf_360
LitData.batch_sampler = "all_images"
MipNeRF Standard Specific Arguments
run.model_name = "mipnerf360" run.max_steps = 1000000 run.log_every_n_steps = 100
LitData.load_radii = True LitData.batch_size = 4096 LitData.chunk = 4096 LitData.use_pixel_centers = True LitData.epoch_size = 250000
LitDataNeRF360V2.near = 0.1 LitDataNeRF360V2.far = 1e6
MipNeRF360.opaque_background = True
run.grad_max_norm = 0.001