nerfstudio-project / nerfstudio

A collaboration friendly studio for NeRFs
https://docs.nerf.studio
Apache License 2.0
9.52k stars 1.3k forks source link

Viewer crashes when using instant-ngp-bounded method #1835

Closed salykova closed 1 year ago

salykova commented 1 year ago

Describe the bug I use the latest version of nerfstudio (built from source). During training with instant-ngp-bounded VRAM consumption is normal (around 4GB). However if I use the viewer, more than 24GB of VRAM is needed, then I get CUDA out of memory error and the viewer crashes, but the training continues.

To Reproduce Steps to reproduce the behavior:

  1. ns-train instant-ngp-bounded --data data/nerfstudio/poster/
  2. Open the viewer

Screenshots image

tancik commented 1 year ago

@liruilong940607 is this related to the recent nerfacc bump to 0.5.2?

salykova commented 1 year ago

Hi @tancik @liruilong940607

Also I noticed that if I use instant-ngp method (not instant-ngp-bounded), training requires around 5GB of VRAM, then when the viewer is opened, allocated VRAM suddenly jumps up to 15-17GB. I would assume all nerfacc-based methods have this problem.

By the way, instant-ngp-bounded method uses all images for training and the same images for evaluation, whereas nerfacto splits the images into train and eval. sets. Should instant-ngp-bounded also do the same?

liruilong940607 commented 1 year ago

Ah so I looked into it and this OOM is because of the cone_angle=0.004 -> cone_angle=0.0 in the PR #1809 for instant-ngp-bounded. This leads to much more samples for instant-ngp-bounded than before.

The reason for changing this default behavior is to match the NGP paper on how they train the bounded nerf-synthetic data.

For instant-ngp model, I reverted to pre-#1809 and the behavior is the same (15GB with viewer opened).

But note that the ultimate reason for such high GPU consumption with NGP-based methods, is NGP-based methods densely sample along the ray (~1000 or more samples per ray) and gradually reduce the number of samples by pruning the space during training. So in the early stage of the training, rendering is slow and memory intense because of the large amount of samples. But as the training goes the situation would become much much better.

That being said, ideally we should have a control of dynamic number of rays for viewer, just like what we have for training (keep the number of samples roughly constant). But I'm not sure if it worth to implement that just to support NGP methods. A easy fix would be just to reduce viewer.num_rays_per_chunk.

BTW, if you want to try out the NGP method, I would recommend using instant-ngp model instead of instant-ngp-bounded on real scenes like poster even though they might be bounded (the bounding box you get for real scenes are ususally neither accurate nor tight so you don't want to rely on that too much). Also instant-ngp by default set the multi-res level of the occupancy grid to 4 while instant-ngp-bounded set it to 1, assuming we pay "uniform attention" to everywhere in the scene bbox.

salykova commented 1 year ago

@liruilong940607

Thanks for the explanation! I can confirm, that instant-ngp uses less memory after occupancy grid is optimized. For example, after 5k steps memory consumption is 11GB instead of 15GB at the beginning. But I observe strange behavior: if I open the viewer after 5k steps and don't move the camera, memory consumption is 11GB. As soon as I change camera position (translate or rotate in the viewer), memory consumption jumps to 18GB. Do you maybe know why is that?

liruilong940607 commented 1 year ago

As we discussed, the GPU memory that NGP-based methods consume, is not a constant value. It not only depends on the training status but also depends on the view point. Because ultimately it is proportional to the number of samples you have to evaluate when rays travel through the scene. Changing the view point would change the traverse path of the rays so would lead to different number of samples based on the occupancy of the scene.

For example, if you move camera behind the wall then everywhere is occupied (because never got observed in training so never got trained), and NGP will give you dense samples.

salykova commented 1 year ago

@liruilong940607

It was also my assumption, but I was a little confused of such huge memory jump (from 11GB to 18GB) only due to view point.

@tancik

I will close the issue, if there is nothing further to discuss

liruilong940607 commented 1 year ago

I opened a PR to reduce the default viewer.num_rays_per_chunk so that the GPU consumption in the beginning should be roughly 11 GB.

salykova commented 1 year ago

@liruilong940607 Thanks!