Slow "traverse_grids" and "skip invisible points"

JokerYan commented 1 year ago

I have been running and testing the speed of the example program. However, I found that the process of "traverse_grids" and "skip invisible points" from the occ_grid.py is much slower than my expectation. They cost around 10x compared to the model forward.

More specifically, the traverse_grids cuda function is slow, and the masking operation at the end of the skipping invisible points is slow.

It would be nice if they can be improved.

Thanks

liruilong940607 commented 1 year ago

Hi traverse_grids should take around 30%-40% of time in the pipeline. And the reason you find masking operation is slow might be that you didn't turn on synchronization when profiling the code (see here for explanations).

In any case, I'm happy to take any PR or implementation ideas that can make things faster!! The current code is reasonably well optimized but there might still be space for improvement!

JokerYan commented 1 year ago

Thank you very much for the reply. I think the issue mentioned here is very much aligned with my observation. The masking without indices does take a lot of time because of the synchronization. Do you know any possible mitigations to this issue?

liruilong940607 commented 1 year ago

If you profile with CUDA_LAUNCH_BLOCKING=1 you would see this making operation is a fairly cheap operation that does not take too much time of the entire pipeline.

Making operation does require synchronization, and I think in this case synchronization is inevitable (another implementation is to use torch.where but that also sync between CPU and GPU)

The reason you are seeing “slowness” of this line is because the PyTorch excite GPU code asynchronous with CPU and only sync when necessary. So when mask operation is excuted, it will wait for all the previous code the finish, including the network. And that’s why you might see this line taking a long time. The actual computational burden, however, is not from this line of code.

liruilong940607 commented 1 year ago

The suggestion on our website is to suggest you should avoid synchornization operations (e.g. masking) whenever possible as keeping things running in the async way as much as possible.

Still sometime sync is inevitable when the output is dependent on all elements in the input.

JokerYan commented 1 year ago

Thank you very much for the detailed explanation. It does explain the wired time I have observed.

Good work on the nerfacc repo. I have been trying to optimize the traverse grid and filter invisible part, but got nowhere. Just want to share some observations I had. I have been running nerfacc and a plenoxel based model on the same set of real world scenes, but the plenoxel based method has much faster sampling speed (around 10x) for the same number of rays during inference. Considering that the sampling strategy of nerfacc is also voxel based, it might be possible to incorporate a similar sampling kernel into nerfacc.

liruilong940607 commented 1 year ago

What underlying radiance field are you using for running nerfacc? Nerfacc’s sampling involving evaluating the radiance field so the runtime highly depends on that. I’m curious in what setting you compare the two.

Also Plenoxels modeling background as MSI which might be faster to query than multi-resolution grid.

JokerYan commented 1 year ago

I am using the ngp version of the nerfacc. I was timing the traverse_grid and skip_invisible separately. If my understanding is correct, the traverse_grid does not evaluate the radiance field, so it should only be affected by the sampling alone.

As for the multi-resolution grid, I am not sure whether the difference would be so significant.

liruilong940607 commented 1 year ago

And which is the “equivalent” function you are comparing to in Plenoxel’s codebase?

Also some details would be very helpful, such as the the parameter you use for traverse_grid function, and the parameter you use for the equivalent function in Plenoxel.

If you want if would be best to dump those parameters and share with me (including nerfacc’s occ grid and Plenoxel’s sparse grid, as well as all other arguments to the function).

I’m happy to look into it if you can provide a minimal code snippet to reproduce this comparison.

JokerYan commented 1 year ago

Sorry that I am now busy with the NeurIPS deadline. I will see whether I can provide you with the test cases I used after the ddl.

However, I am not 100% sure about whether my timer is capturing the actual runtime due to various issues. It is difficult to get exactly the same environment for the two codebases as well. I hope that I am not wasting your time.

JokerYan commented 1 year ago

I have uploaded the parameters used for sampling in NerfAcc and Plenoxels here.

For NerfAcc, I am calling traverse_grids with all the parameters provided in the pkl. For Plenoxel, the sampling and rendering are all bundled in the cuda file that is difficult for me to separately dump the parameters. So I just call grid.volume_render_image(cam, use_kernel=True) using the grid and cam in the pkl. The sampling + rendering is still much faster than NerfAcc sampling alone. The parameters are dumped in the middle of the experiment on a real-world multi-view forward facing scene from MeetRoom dataset in this paper. It is quite difficult to do a controlled experiment to time it very fairly, but I think the big speed difference can still tell us something. Both are tested with the grad turned on.

As you have suggested to me previously, I call the torch.cuda.synchronize() before and after the target function call. The time that I get is as follow:

NerfAcc: 1.74 seconds per 1M rays sampled
Plenoxel: 0.0814 seconds per 1M rays sampled+rendered

Thank you for the hard work on NerfAcc, and please let me know if there is anything I can help.

nerfstudio-project / nerfacc

Slow "traverse_grids" and "skip invisible points" #212