nerfstudio-project / nerfacc

A General NeRF Acceleration Toolbox in PyTorch.
https://www.nerfacc.com/
Other
1.38k stars 112 forks source link

CUDA error: an illegal memory access was encountered #125

Closed thomasw21 closed 1 year ago

thomasw21 commented 1 year ago

Hi! Thanks for the great library, I've been playing around it for some time and it's been a blast!

Recently I've started encountering weird CUDA errors:

Traceback (most recent call last):                                                                                                                                                                                                                   
  File "/blablabla/my_project/train.py", line 1420, in <module>                                                                                                                                          
    main()                                                                                                                                                                                                                                           
  File "/blablabla/my_project/train.py", line 1145, in main                                                                                                                                              
    images, opacities, density_origin, rays = render_images(                                                                                                                                                                                         
  File "/blablabla/my_project/train.py", line 665, in render_images                                                                                                                                      
    ray_indices = unpack_info(packed_info=packed_info, n_samples=t_starts.shape[0])                                                                                                                                                                  
  File "/blablabla/my_envs/text23d/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context                                                                                                             
    return func(*args, **kwargs)                                                                                                                                                                                                                     
  File "/home/thomaswang/code/text23d/nerfacc/nerfacc/pack.py", line 121, in unpack_info                                                                                                                                                             
    return ray_indices.long()                                                                                                                                                                                                                        RuntimeError: CUDA error: an illegal memory access was encountered  

I think the actual error comes from accessing values in packed_info (I tried printing that tensor and the same error happened). One of the things I do just before is use the nerfacc.ray_resampling method. I'm wondering if there's somehow a pointer that would potentially point somewhere it shouldn't. I tried reading https://github.com/KAIR-BAIR/nerfacc/blob/0e1e7cb7c6f8e7e77073bc88f1333fddc7474c18/nerfacc/cuda/csrc/cdf.cu but it's quite hard to read anything there without having a good understanding of what is being implemented exactly.

I don't have a reproducible script, it randomly happens sometime. But I've removed the sampling and things seem to work great.

I have 0e1e7cb7c6f8e7e77073bc88f1333fddc7474c18 installed

thomasw21 commented 1 year ago

Hum actually while debugging, it seems that some weights I get from nerfacc.render_weight_from_density are returning some nan values.

liruilong940607 commented 1 year ago

Hi Thomas,

Yeah if there is nan in the weights, then nerfacc.ray_resampling would fail. It is a good point that I should catch that and raise error in nerfacc.resampling.

I'm wondering did you check the density being used by render_weight_from_density? Is it the case that there is no nan in the density but there is nan in the weights? My experience is tiny-cuda-nn can sometime have unreliable gradients that mess up the networks.

thomasw21 commented 1 year ago

Hum so I just checked, and it seems you're right. tiny-cuda-nn seems to be returning nan, while none of the model weights/inputs are nans. I'll close this issue as it doesn't seem related to this repo.