graphdeco-inria / gaussian-splatting

Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering"
https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Other
13.5k stars 1.73k forks source link

CUDA error: an illegal memory access was encountered #781

Closed baoachun closed 1 month ago

baoachun commented 4 months ago

I have identified that the identifyTileRanges function is causing the issue, but I'm not quite sure how to resolve it. Do you have any constructive suggestions? https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/59f5f77e3ddbac3ed9db93ec2cfe99ed6c5d121d/cuda_rasterizer/rasterizer_impl.cu#L116

Oddly, I didn't encounter any issues when I directly loaded the saved input parameters for rendering. dump_file.zip

pytorch 1.13.1
cuda 11.4
A100 SXM4 80G

After several days of debugging, I found that the concatenation of keys might have caused anomalies, leading to memory errors. By storing them separately instead of concatenating, this issue was resolved. However, another exception occurred in the FORWARD::render function, and the problem couldn't be reproduced with the parameters saved in debug mode.

PanagiotisP commented 4 months ago

Does it happen on the first iteration or later on? Can you check whether you have nans on any tensors fed to the rasterisation function?

baoachun commented 4 months ago

@PanagiotisP Yes, the issue tends to occur after iterating hundreds of times, but sometimes it may take thousands of iterations to appear. I have checked the input parameters and have not found any NaN values. However, I noticed that the error occurs because currtile or prevtile exceeds the length of ranges, resulting in a memory out-of-bounds issue. At this point, currtile or prevtile are very large strange numbers, such as prevtile=3210786815 and currtile=1072937470. Do you have any insights on this?

PanagiotisP commented 4 months ago

No, I'm sorry. I'm not very familiar with the tiling procedure, so NaN was my best bet. In your place, I think, I would also check if extremely big scale values appear for some reason (e.g. due to a regularisation term gone bad). But other than that, your guess is as good as mine

Devlee247 commented 4 months ago

@baoachun Did you solve this problem? I also got a very large strange numbers (currtile, prevtile).

baoachun commented 4 months ago

@Devlee247 Yes, I changed the concatenation operation of ID and depth to store them separately, and that bug is fixed. However, I still encounter errors in the FORWARD::render function, and I suspect there are other unresolved issues in the FORWARD::render function.

Devlee247 commented 4 months ago

@baoachun Thank you for sharing, I also fixed via adding torch.cuda.empty_cache() in the forward function. (GaussianSplatting pytorch Class)

baoachun commented 4 months ago

@Devlee247 Thank you for sharing, unfortunately, this method does not solve my problem.

huahangc commented 3 months ago

@PanagiotisP How to solve this problem which the gaussian primitives attributions turn to be Nan when I train for thousands iterations? And I check the Nan value and turn all nan to 0, but in the rest iterations, all the loss come to be nan.

PanagiotisP commented 3 months ago

I am not sure I can help with that, as nan propagates instantly to everything. Usually, you want to ensure that you don't make any obvious illegal operations, like dividing with zeros, taking the log of a non-positive number etc.

RaymondJiangkw commented 2 months ago

Perhaps check my_radius. Replacing the float my_radius = ceil(...) with int my_radius = max(ceil(...), 1) might help.

manurare commented 2 months ago

@PanagiotisP Yes, the issue tends to occur after iterating hundreds of times, but sometimes it may take thousands of iterations to appear. I have checked the input parameters and have not found any NaN values. However, I noticed that the error occurs because currtile or prevtile exceeds the length of ranges, resulting in a memory out-of-bounds issue. At this point, currtile or prevtile are very large strange numbers, such as prevtile=3210786815 and currtile=1072937470. Do you have any insights on this?

@baoachun I am having the same problem. I found out that in my input tensors there are no NaN values, but there are Inf values and this is what causing the error. Do you know if in your input tensors you had Inf values?

EDIT:

I did some debug on this. Values that are Inf (for example scales in my case) cause NaN radius. However, the gaussians with NaN radius are nonetheless added as rendered gaussians (num_rendered) since there is no check for NaNs in preprocess. Next, we will reallocate num_rendered uninitialized memory for the binningState object.

When duplicating keys, we iterate over all gaussians with positive radius and assign values to the keys and values arrays which are arrays of num_rendered size from binningState. Remember that NaN gaussians are rendered gaussians but they don't get any key/value because their radii is not positive and thus, their key and value are never assigned: their values come from uninitialized memory. This causes the illegal memory access in some cases. This is why the error occurs randomly, because unitialized memory is unpredictable and can have any value at any time.

torracxiaokeai commented 1 month ago

I meet same error with you, but the methods under this issue cannot address my porblem. I wonder that how can I identified my error?

baoachun commented 1 month ago

@PanagiotisP Yes, the issue tends to occur after iterating hundreds of times, but sometimes it may take thousands of iterations to appear. I have checked the input parameters and have not found any NaN values. However, I noticed that the error occurs because currtile or prevtile exceeds the length of ranges, resulting in a memory out-of-bounds issue. At this point, currtile or prevtile are very large strange numbers, such as prevtile=3210786815 and currtile=1072937470. Do you have any insights on this?

@baoachun I am having the same problem. I found out that in my input tensors there are no NaN values, but there are Inf values and this is what causing the error. Do you know if in your input tensors you had Inf values?

EDIT:

I did some debug on this. Values that are Inf (for example scales in my case) cause NaN radius. However, the gaussians with NaN radius are nonetheless added as rendered gaussians (num_rendered) since there is no check for NaNs in preprocess. Next, we will reallocate num_rendered uninitialized memory for the binningState object.

When duplicating keys, we iterate over all gaussians with positive radius and assign values to the keys and values arrays which are arrays of num_rendered size from binningState. Remember that NaN gaussians are rendered gaussians but they don't get any key/value because their radii is not positive and thus, their key and value are never assigned: their values come from uninitialized memory. This causes the illegal memory access in some cases. This is why the error occurs randomly, because unitialized memory is unpredictable and can have any value at any time.

It seems this could be the reason. I also noticed large strange numbers, possibly because the memory was not explicitly initialized, but I didn't continue investigating further.

I have been using the gsplat package, which doesn't throw an error when NaN or inf values appear, but the rendering result is incorrect. In such cases, you need to debug and check manually.

For those encountering similar issues in the future, it might be helpful to print out the value of the scale attribute. I suspect that the problem might be caused by the scale attribute having inf values.

smart4654154 commented 2 weeks ago

gsplat package

gsplat package has the same error,do you know how to solve it?thank you https://github.com/nerfstudio-project/gsplat/issues/341