CUDA error in backward.

HatsuneMiku888 commented 1 year ago

Hi, I experienced RuntimeError: an illegal memory access was encountered when I train 3d gaussian on the T&T dataset. It seems to happen in backpropagation. Here is the input of the backward function.

And the error disappeared when I commented out https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/main/cuda_rasterizer/backward.cu#L503. I have no idea about why this line would cause illegal memory access.

Snosixtyboo commented 1 year ago

Hi,

commenting that line as you did will significantly change the math of the gradient computation and should give you very bad results. We are currently at Siggraph, but when we get back we will see what we can find from the .dump you shared.

HatsuneMiku888 commented 1 year ago

Thanks for your reply! I know commenting that line can't be a final solution. It just to locate where the things going wrong. I mean the backpropagation can passed successfully under the same input without that line.

ray8828 commented 1 year ago

The same problem appears to me, there are 3 issues for the invalid memory now, and none of them can work out... could someone help? thanks!

Snosixtyboo commented 1 year ago

The same problem appears to me, there are 3 issues for the invalid memory now, and none of them can work out... could someone help? thanks!

Hi @ray8828 , if you have that issue can you post your hardware setup and the .dump for when the crash occurred? Creating the dump file requires running with --debug

Snosixtyboo commented 1 year ago

@HatsuneMiku888 I finally had the time to look at your output. It seems that you are using both Python-computed covariance matrices and colors (--convert_SHs_python and --convert_cov3D_python are active), any particular reason for this? We left those paths in for compatibility and experimenting, they are not heavily tested.

Snosixtyboo commented 1 year ago

@HatsuneMiku888 I found the line that causes the crash. Unfortunately, I have no explanation:

For some reason, a point ID with a number that is way too high gets into the list of points to render. Unfortunately, I don't know how I could debug this without extensive access to the machine that it happens on. We could set this up, but it will take a while before I have time to do this. From the dump alone I have no idea how this could occur. Is it reproducible? Does it also happen when the two options I mentioned above are turned off?

Last but not least, also for @ray8828 , another user has set up a Colab that seems to successfully run the code base on T&T. This could hopefully reduce issues with local project setups, so maybe this will work out for you https://github.com/camenduru/gaussian-splatting-colab

rgxie commented 1 year ago

I have met the same problem, after commenting out the line mentioned above, the code works well.（https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/main/cuda_rasterizer/backward.cu#L503 ）

Snosixtyboo commented 1 year ago

I have met the same problem, after commenting out the line mentioned above, the code works well.（https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/main/cuda_rasterizer/backward.cu#L503 ）

Hi,

please note that this is not a fix, it will completely break the math behind the approach. If you continue to have issues with running it, please consider using the Colab linked on the main page.

HatsuneMiku888 commented 1 year ago

@HatsuneMiku888 I found the line that causes the crash. Unfortunately, I have no explanation:

For some reason, a point ID with a number that is way too high gets into the list of points to render. Unfortunately, I don't know how I could debug this without extensive access to the machine that it happens on. We could set this up, but it will take a while before I have time to do this. From the dump alone I have no idea how this could occur. Is it reproducible? Does it also happen when the two options I mentioned above are turned off?

Last but not least, also for @ray8828 , another user has set up a Colab that seems to successfully run the code base on T&T. This could hopefully reduce issues with local project setups, so maybe this will work out for you https://github.com/camenduru/gaussian-splatting-colab

1073280485 is very close to 2^30, maybe there are some numeric overflow?

Snosixtyboo commented 1 year ago

@HatsuneMiku888 how good is your Python? Could you force it to create the snapshow_fw.dump of the forward pass (even tho it doesn't fail) for the frame where the backward fails and forward it to us?

rgxie commented 1 year ago

I have met the same problem, after commenting out the line mentioned above, the code works well.（https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/main/cuda_rasterizer/backward.cu#L503 ）

Hi,

please note that this is not a fix, it will completely break the math behind the approach. If you continue to have issues with running it, please consider using the Colab linked on the main page.

Thank you for your reply. I know that is not a fix. I am trying to locate the bug, this error occurs at different iterations when I use different data.

HatsuneMiku888 commented 1 year ago

@HatsuneMiku888 how good is your Python? Could you force it to create the snapshow_fw.dump of the forward pass (even tho it doesn't fail) for the frame where the backward fails and forward it to us?

Sure, I will attempt to reproduce this error on the machine where it occurred.

Btw, now I have a new problem. I faced the same Illegal memory access error during the forward training process on other dataset. But the error miraculously disappeared when I executed _C.rasterize_gaussians using snapshot_fw.dump as parameters in a separate script.

sihuanian-2 commented 7 months ago

@HatsuneMiku888 I found the line that causes the crash. Unfortunately, I have no explanation:

For some reason, a point ID with a number that is way too high gets into the list of points to render. Unfortunately, I don't know how I could debug this without extensive access to the machine that it happens on. We could set this up, but it will take a while before I have time to do this. From the dump alone I have no idea how this could occur. Is it reproducible? Does it also happen when the two options I mentioned above are turned off?

Last but not least, also for @ray8828 , another user has set up a Colab that seems to successfully run the code base on T&T. This could hopefully reduce issues with local project setups, so maybe this will work out for you https://github.com/camenduru/gaussian-splatting-colab

Hello,I have the same error. And I want to know how to debug the cuda code in gaussian-splatting.I just know how to debug the python file.

smart4654154 commented 3 months ago

@HatsuneMiku888 I found the line that causes the crash. Unfortunately, I have no explanation: For some reason, a point ID with a number that is way too high gets into the list of points to render. Unfortunately, I don't know how I could debug this without extensive access to the machine that it happens on. We could set this up, but it will take a while before I have time to do this. From the dump alone I have no idea how this could occur. Is it reproducible? Does it also happen when the two options I mentioned above are turned off? Last but not least, also for @ray8828 , another user has set up a Colab that seems to successfully run the code base on T&T. This could hopefully reduce issues with local project setups, so maybe this will work out for you https://github.com/camenduru/gaussian-splatting-colab

Hello,I have the same error. And I want to know how to debug the cuda code in gaussian-splatting.I just know how to debug the python file.

do you know the result,thank you

graphdeco-inria / gaussian-splatting

CUDA error in backward. #81