CUDA error while running

ibrahimrazu commented 1 year ago

Hi, Thanks a lot for sharing your great work. I've built the environment properly and while running the tanks and temples dataset, getting following error:

Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 204, in training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations) File "train.py", line 81, in training loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image)) File "/media/sdc/merf_research/gaussian_mixture/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Any idea how to resolve? My environment matches exactly with your repo

grgkopanas commented 1 year ago

Hi, can you run with cuda-memcheck ?

On Wed, Jul 12, 2023 at 8:44 PM Md Ibrahim Khalil @.***> wrote:

Hi, Thanks a lot for sharing your great work. I've built the environment properly and while running the tanks and temples dataset, getting following error:

Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 204, in training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations) File "train.py", line 81, in training loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image)) File "/media/sdc/merf_research/gaussian_mixture/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Any idea how to resolve? My environment matches exactly with your repo

— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYJFUIX6OSE77RJ7HIDXP5VKBANCNFSM6AAAAAA2IKOKRA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ibrahimrazu commented 1 year ago

@grgkopanas hi, you mean running some .cu script with cuda-memcheck? Sorry i did not understand

grgkopanas commented 1 year ago

You can run from the command line: cuda-memcheck python script.py [cli args]

On Wed, Jul 12, 2023, 21:03 Md Ibrahim Khalil @.***> wrote:

@grgkopanas https://github.com/grgkopanas hi, you mean running some .cu script with cuda-memcheck? Sorry i did not understand

— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633512091, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYJHBJ2G4CCHAOUBYNDXP5XPXANCNFSM6AAAAAA2IKOKRA . You are receiving this because you were mentioned.Message ID: @.***>

ibrahimrazu commented 1 year ago

thanks! after running cuda-memcheck i got this:

Traceback (most recent call last): File "train.py", line 204, in training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations) File "train.py", line 75, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/media/sdc/merf_research/gaussian_mixture/gaussian-splatting/gaussian_renderer/init.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:01<?, ?it/s] ========= ERROR SUMMARY: 505 errors

based on the logs, it seems like the problem with rasterizer

grgkopanas commented 1 year ago

is that all the logs, what about the 505 errors? :D

ibrahimrazu commented 1 year ago

A glimpse of that :)

========= Invalid global write of size 8 ========= at 0x000006d0 in duplicateWithKeys(int, float2 const , float const , unsigned int const , unsigned long, unsigned int, int, dim3) ========= by thread (160,0,0) in block (211,0,0) ========= Address 0x7f63d1d08820 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2e9441] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 [0x1433c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 (cudaLaunchKernel + 0x1d8) [0 x69c38] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so [0x3e4 0c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so (Z62 _device_stub__Z17duplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3iPK6float2PKfPKjPmPjPiR4dim3 + 0x1e9) [0x3bf69] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so (_Z17d uplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3 + 0x49) [0x3bfcc] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so (_ZN14 CudaRasterizer10Rasterizer7forwardESt8functionIFPcmEES4_S4_iiiPKfiiS6_S6_S6_S6_S6_fS6_S6_S6_S6_S6_ffbPfPi + 0x519) [0x3b4eb] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so (_Z22R asterizeGaussiansCUDARKN2at6TensorES2_S2_S2_S2_S2_fS2_S2_S2_ffiiS2_iS2_b + 0x94d) [0x5e597] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so [0x5d2 45] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so [0x57b 52] ========= Host Frame:python (PyCFunction_Call + 0xa0) [0xd5990] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x558e) [0xb882e] ========= Host Frame:python (_PyFunction_FastCallDict + 0x116) [0xcce46] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/libtorch_python.so (_Z17THPFunction_applyP7objectS0 + 0x5 d6) [0x696c96] ========= Host Frame:python (_PyMethodDef_RawFastCallKeywords + 0x1fb) [0xbb4ab] ========= Host Frame:python [0xbaf40] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x469a) [0xb793a] ========= Host Frame:python (_PyFunction_FastCallKeywords + 0x106) [0xc61f6] ========= Host Frame:python [0xbae2f] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x971) [0xb3c11] ========= Host Frame:python (_PyEval_EvalCodeWithName + 0x201) [0xb2041] ========= Host Frame:python (_PyFunction_FastCallDict + 0x2d6) [0xcd006] ========= Host Frame:python [0xd5ec0] ========= Host Frame:python (PyObject_Call + 0x51) [0xd35a1] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x1ea8) [0xb5148]

grgkopanas commented 1 year ago

Can you share the dataset with us?

On Wed, Jul 12, 2023 at 9:40 PM Md Ibrahim Khalil @.***> wrote:

A glimpse of that :)

========= Invalid global write of size 8 ========= at 0x000006d0 in duplicateWithKeys(int, float2 const , float const , unsigned int const , unsigned long, unsigned int, int, dim3) ========= by thread (160,0,0) in block (211,0,0) ========= Address 0x7f63d1d08820 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2e9441] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 [0x1433c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 (cudaLaunchKernel + 0x1d8) [0 x69c38] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x3e4 0c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (Z62 _device_stub__Z17duplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3iPK6float2PKfPKjPmPjPiR4dim3

0x1e9) [0x3bf69] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_Z17d uplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3 + 0x49) [0x3bfcc] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_ZN14 CudaRasterizer10Rasterizer7forwardESt8functionIFPcmEES4_S4_iiiPKfiiS6_S6_S6_S6_S6_fS6_S6_S6_S6_S6_ffbPfPi

0x519) [0x3b4eb] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_Z22R asterizeGaussiansCUDARKN2at6TensorES2_S2_S2_S2_S2_fS2_S2_S2_ffiiS2_iS2_b + 0x94d) [0x5e597] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x5d2 45] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x57b 52] ========= Host Frame:python (PyCFunction_Call + 0xa0) [0xd5990] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x558e) [0xb882e] ========= Host Frame:python (_PyFunction_FastCallDict + 0x116) [0xcce46] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/libtorch_python.so (Z17THPFunction_applyP7_objectS0 + 0x5 d6) [0x696c96] ========= Host Frame:python (_PyMethodDef_RawFastCallKeywords + 0x1fb) [0xbb4ab] ========= Host Frame:python [0xbaf40] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x469a) [0xb793a] ========= Host Frame:python (_PyFunction_FastCallKeywords + 0x106) [0xc61f6] ========= Host Frame:python [0xbae2f] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x971) [0xb3c11] ========= Host Frame:python (_PyEval_EvalCodeWithName + 0x201) [0xb2041] ========= Host Frame:python (_PyFunction_FastCallDict + 0x2d6) [0xcd006] ========= Host Frame:python [0xd5ec0] ========= Host Frame:python (PyObject_Call + 0x51) [0xd35a1] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x1ea8) [0xb5148]

— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633534673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYNCZZWEI3NNFJGFLLDXP535BANCNFSM6AAAAAA2IKOKRA . You are receiving this because you were mentioned.Message ID: @.***>

grgkopanas commented 1 year ago

sorry I just realized you mentioned tanks and temples, which scene are you trying?

On Wed, Jul 12, 2023 at 9:42 PM George Kopanas @.***> wrote:

Can you share the dataset with us?

On Wed, Jul 12, 2023 at 9:40 PM Md Ibrahim Khalil < @.***> wrote:

A glimpse of that :)

========= Invalid global write of size 8 ========= at 0x000006d0 in duplicateWithKeys(int, float2 const , float const , unsigned int const , unsigned long, unsigned int, int, dim3) ========= by thread (160,0,0) in block (211,0,0) ========= Address 0x7f63d1d08820 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2e9441] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 [0x1433c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 (cudaLaunchKernel + 0x1d8) [0 x69c38] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x3e4 0c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (Z62 _device_stub__Z17duplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3iPK6float2PKfPKjPmPjPiR4dim3

0x1e9) [0x3bf69] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_Z17d uplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3 + 0x49) [0x3bfcc] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_ZN14 CudaRasterizer10Rasterizer7forwardESt8functionIFPcmEES4_S4_iiiPKfiiS6_S6_S6_S6_S6_fS6_S6_S6_S6_S6_ffbPfPi

0x519) [0x3b4eb] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_Z22R asterizeGaussiansCUDARKN2at6TensorES2_S2_S2_S2_S2_fS2_S2_S2_ffiiS2_iS2_b

0x94d) [0x5e597] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x5d2 45] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x57b 52] ========= Host Frame:python (PyCFunction_Call + 0xa0) [0xd5990] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x558e) [0xb882e] ========= Host Frame:python (_PyFunction_FastCallDict + 0x116) [0xcce46] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/libtorch_python.so (Z17THPFunction_applyP7_objectS0 + 0x5 d6) [0x696c96] ========= Host Frame:python (_PyMethodDef_RawFastCallKeywords + 0x1fb) [0xbb4ab] ========= Host Frame:python [0xbaf40] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x469a) [0xb793a] ========= Host Frame:python (_PyFunction_FastCallKeywords + 0x106) [0xc61f6] ========= Host Frame:python [0xbae2f] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x971) [0xb3c11] ========= Host Frame:python (_PyEval_EvalCodeWithName + 0x201) [0xb2041] ========= Host Frame:python (_PyFunction_FastCallDict + 0x2d6) [0xcd006] ========= Host Frame:python [0xd5ec0] ========= Host Frame:python (PyObject_Call + 0x51) [0xd35a1] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x1ea8) [0xb5148]

— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633534673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYNCZZWEI3NNFJGFLLDXP535BANCNFSM6AAAAAA2IKOKRA . You are receiving this because you were mentioned.Message ID: @.***>

ibrahimrazu commented 1 year ago

Its the same T&T dataset downloaded from here: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/datasets/input/tandt_db.zip

ibrahimrazu commented 1 year ago

@grgkopanas thanks! i was trying truck

ibrahimrazu commented 1 year ago

Its my environment:

Python version: 3.7.13 (default, Oct 18 2022, 18:57:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-debian-bookworm-sid Is CUDA available: True CUDA runtime version: 11.7.99 GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB GPU 1: Tesla V100-PCIE-32GB GPU 2: Tesla V100-PCIE-32GB GPU 3: Tesla V100-PCIE-32GB GPU 4: Tesla V100-PCIE-32GB GPU 5: Tesla V100-PCIE-32GB GPU 6: Tesla V100-PCIE-32GB GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 515.48.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.21.5 [pip3] torch==1.12.1 [pip3] torchaudio==0.12.1 [pip3] torchvision==0.13.1 [conda] blas 1.0 mkl [conda] cudatoolkit 11.6.2 hfc3e2af_12 conda-forge [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libblas 3.9.0 12_linux64_mkl conda-forge [conda] libcblas 3.9.0 12_linux64_mkl conda-forge [conda] liblapack 3.9.0 12_linux64_mkl conda-forge [conda] mkl 2021.4.0 h8d4b97c_729 conda-forge [conda] mkl-service 2.4.0 py37h402132d_0 conda-forge [conda] mkl_fft 1.3.1 py37h3e078e5_1 conda-forge [conda] mkl_random 1.2.2 py37h219a48f_0 conda-forge [conda] numpy 1.21.5 py37h6c91a56_3 [conda] numpy-base 1.21.5 py37ha15fc14_3 [conda] pytorch 1.12.1 py3.7_cuda11.6_cudnn8.3.2_0 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchaudio 0.12.1 py37_cu116 pytorch [conda] torchvision 0.13.1 py37_cu116 pytorch

grgkopanas commented 1 year ago

oooh boy that's a lot of GPUs, do you have a way to isolate your node to a single GPU, we dont utilize multiple GPUs anyway.

Best, George

On Wed, Jul 12, 2023 at 10:00 PM Md Ibrahim Khalil @.***> wrote:

Its my environment:

Python version: 3.7.13 (default, Oct 18 2022, 18:57:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-debian-bookworm-sid Is CUDA available: True CUDA runtime version: 11.7.99 GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB GPU 1: Tesla V100-PCIE-32GB GPU 2: Tesla V100-PCIE-32GB GPU 3: Tesla V100-PCIE-32GB GPU 4: Tesla V100-PCIE-32GB GPU 5: Tesla V100-PCIE-32GB GPU 6: Tesla V100-PCIE-32GB GPU 7: Tesla V100-PCIE-32GB

Nvidia driver version: 515.48.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.21.5 [pip3] torch==1.12.1 [pip3] torchaudio==0.12.1 [pip3] torchvision==0.13.1 [conda] blas 1.0 mkl [conda] cudatoolkit 11.6.2 hfc3e2af_12 conda-forge [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libblas 3.9.0 12_linux64_mkl conda-forge [conda] libcblas 3.9.0 12_linux64_mkl conda-forge [conda] liblapack 3.9.0 12_linux64_mkl conda-forge [conda] mkl 2021.4.0 h8d4b97c_729 conda-forge [conda] mkl-service 2.4.0 py37h402132d_0 conda-forge [conda] mkl_fft 1.3.1 py37h3e078e5_1 conda-forge [conda] mkl_random 1.2.2 py37h219a48f_0 conda-forge [conda] numpy 1.21.5 py37h6c91a56_3 [conda] numpy-base 1.21.5 py37ha15fc14_3 [conda] pytorch 1.12.1 py3.7_cuda11.6_cudnn8.3.2_0 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchaudio 0.12.1 py37_cu116 pytorch [conda] torchvision 0.13.1 py37_cu116 pytorch

— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633548339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYNN34RTKUOX36YQ6JTXP56HVANCNFSM6AAAAAA2IKOKRA . You are receiving this because you were mentioned.Message ID: @.***>

ibrahimrazu commented 1 year ago

I’m confining the training to gpu-0 only. It barely reaches 22 GB before throwing CUDA error here:

File "train.py", line 81, in training loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image)) File "/media/sdc/merf_research/gaussian_mixture/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered

grgkopanas commented 1 year ago

I am not coming up with any bright ideas right now, I will discuss this with @Snosixtyboo tomorrow and maybe we can come up with something.

grgkopanas commented 1 year ago

if you can pipe the logs to a file and send them it might be usefull.

ibrahimrazu commented 1 year ago

Thanks! Meanwhile i’ll be trying to rebuild the environment with slightly updated PyTorch and CUDA.

Bests Ibrahim

On Thu, Jul 13, 2023 at 1:13 AM grgkopanas @.***> wrote:

I am not coming up with any bright ideas right now, I will discuss this with @Snosixtyboo https://github.com/Snosixtyboo tomorrow and maybe we can come up with something.

— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633558102, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXVK4EREA7P6LNZXPR5SSTXP57XDANCNFSM6AAAAAA2IKOKRA . You are receiving this because you authored the thread.Message ID: @.***>

Snosixtyboo commented 1 year ago

Hi, thanks for raising this! Could you tell us the exact flags you use to run this? It seems the code fails immediately, in the very first iteration. I also don't understand how tanks and temples can require 22 GB of VRAM, for me it never goes beyond 10.

Could you maybe

upload the full memcheck output somewhere for us
tell us OS and commit you are on
provide us with the flags you use for running?

Snosixtyboo commented 1 year ago

Hi,

I've added experimental state dumping functionality. You should get it via

pip uninstall diff-gaussian-rasterization
cd <gaussian-splatting>/submodules/diff-gaussian-rasterization
git pull
git checkout debug
python setup.py install

Then, if the training crashes at some point, it should write a BIG state file (>1GB). If this can then be uploaded somewhere and you think it's worth it, we will gladly take a look at it. If this is too bulky, we could think about other ways of finding the bug, but they will be more time-consuming...

fasogbon commented 1 year ago

Hello, I am also having same problems. After various tweaks, this came up.

(gaussian_splatting) pppp@pppp:~/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data
Optimizing 
Output folder: ./output/e7d171c6-1 [13/07 21:53:19]
Tensorboard not available: not logging progress [13/07 21:53:19]
Found transforms_train.json file, assuming Blender data set! [13/07 21:53:19]
Reading Training Transforms [13/07 21:53:19]
Reading Test Transforms [13/07 21:53:21]
Loading Training Cameras [13/07 21:53:21]
Loading Test Cameras [13/07 21:53:23]
Number of points at initialisation :  100000 [13/07 21:53:23]
Training progress:   0%|                                                                 | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 208, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations)
  File "train.py", line 83, in training
    loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))
  File "/home/pppp/-----/gaussian-splatting/utils/loss_utils.py", line 41, in ssim
    return _ssim(img1, img2, window, window_size, channel, size_average)
  File "//home/pppp/-----/gaussian-splatting/utils/loss_utils.py", line 45, in _ssim
    mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel)
RuntimeError: CUDA error: an illegal instruction was encountered
Training progress:   0%|                                                                 | 0/30000 [00:00<?, ?it/s]

Snosixtyboo commented 1 year ago

Hello, I am also having same problems. After various tweaks, this came up.

(gaussian_splatting) pppp@pppp:~/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data
Optimizing 
Output folder: ./output/e7d171c6-1 [13/07 21:53:19]
Tensorboard not available: not logging progress [13/07 21:53:19]
Found transforms_train.json file, assuming Blender data set! [13/07 21:53:19]
Reading Training Transforms [13/07 21:53:19]
Reading Test Transforms [13/07 21:53:21]
Loading Training Cameras [13/07 21:53:21]
Loading Test Cameras [13/07 21:53:23]
Number of points at initialisation :  100000 [13/07 21:53:23]
Training progress:   0%|                                                                 | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 208, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations)
  File "train.py", line 83, in training
    loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))
  File "/home/pppp/-----/gaussian-splatting/utils/loss_utils.py", line 41, in ssim
    return _ssim(img1, img2, window, window_size, channel, size_average)
  File "//home/pppp/-----/gaussian-splatting/utils/loss_utils.py", line 45, in _ssim
    mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel)
RuntimeError: CUDA error: an illegal instruction was encountered
Training progress:   0%|                                                                 | 0/30000 [00:00<?, ?it/s]

Hey,

that looks like a different error. Are you running this on the Nerf Synthetic Blender Dataset? Otherwise, we don't have support for arbitrary transforms-based data sets yet (see console output "assuming Blender data set!") Also, what were those tweaks?

fasogbon commented 1 year ago

I have applied it on DTU dataset with colmap undistortion applied. I will be delighted if you can at least get this data to work. I will attach the dataset here and error. Thank you a lot

https://easyupload.io/2ehbdz

(gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data2 Optimizing Output folder: ./output/4a22b6af-0 [14/07 02:22:33] Tensorboard not available: not logging progress [14/07 02:22:33] Reading camera 20/20 [14/07 02:22:33] Loading Training Cameras [14/07 02:22:33] Loading Test Cameras [14/07 02:22:36] Number of points at initialisation : 5305 [14/07 02:22:36] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 208, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations) File "train.py", line 77, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/peter/Desktop/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]

fasogbon commented 1 year ago

I also have the same error when i used your tandt_db.zip dataset

Snosixtyboo commented 1 year ago

Hi,

both the DTU dataset and TandT work for us locally, we cannot reproduce your error. What were those tweaks that you mentioned? Did you make tweaks to the code?

Also, [here] (https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1634728295) we describe how you could get a debug version of the rasterizer. This should create a crash dump that you can then forward to us, otherwise, I don't know how we could help since we can't reproduce your error.

Best, Bernhard

fasogbon commented 1 year ago

Hi,

I repull all the source code again (No more tweaks) and did the debug version that you recommended. I still have this error (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ python train.py -s data4 Optimizing Output folder: ./output/f910c547-5 [15/07 11:00:04] Tensorboard not available: not logging progress [15/07 11:00:04] Reading camera 11/11 [15/07 11:00:04] Loading Training Cameras [15/07 11:00:04] Loading Test Cameras [15/07 11:00:06] Number of points at initialisation : 1696 [15/07 11:00:06] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 87, in training loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image)) File "/home/peter/Desktop/Research/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:00<?, ?it/s] (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$

Then I added CUDA_......, I still got this eror (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data4 Optimizing Output folder: ./output/a5bd0541-a [15/07 11:01:22] Tensorboard not available: not logging progress [15/07 11:01:22] Reading camera 11/11 [15/07 11:01:22] Loading Training Cameras [15/07 11:01:22] Loading Test Cameras [15/07 11:01:24] Number of points at initialisation : 1696 [15/07 11:01:24] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 81, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/peter/Desktop/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]

I comment out the # --Mapper.ba_global_function_tolerance=0.000001" in the convert.py colmap code so that I can use old version of colmap. I DO NOT think thats the problem since the tank and truck dataset provided in tandt_db.zip does not work either (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data3/tandt/truck Optimizing Output folder: ./output/6242231d-a [15/07 11:06:44] Tensorboard not available: not logging progress [15/07 11:06:44] Reading camera 251/251 [15/07 11:06:45] Loading Training Cameras [15/07 11:06:45] Loading Test Cameras [15/07 11:06:50] Number of points at initialisation : 136029 [15/07 11:06:50] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 81, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/peter/Desktop/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1186 G /usr/lib/xorg/Xorg 288MiB | | 0 N/A N/A 1417 G /usr/bin/gnome-shell 125MiB | | 0 N/A N/A 3343 G ...9683677,17313697408044379519,262144 107MiB | +---------------------------------------------------------------------------------------+ (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ nvcc--version nvcc--version: command not found (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_Mar__8_18:18:20_PST_2022 Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0 (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ `

In addition, I have also tried to reduce the number of images in case of memory issues but still give errors

fasogbon commented 1 year ago

I also tried earlier suggestion with memorycheck, there was no error (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ cuda-memcheck python train.py -s data3/tandt/truck ========= CUDA-MEMCHECK ========= This tool is deprecated and will be removed in a future release of the CUDA toolkit ========= Please use the compute-sanitizer tool as a drop-in replacement Optimizing Output folder: ./output/49d3913d-c [15/07 11:13:35] Tensorboard not available: not logging progress [15/07 11:13:35] Reading camera 251/251 [15/07 11:13:36] Loading Training Cameras [15/07 11:13:36] Loading Test Cameras [15/07 11:13:41] Number of points at initialisation : 136029 [15/07 11:13:41] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 87, in training loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image)) File "/home/peter/Desktop/Research/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:00<?, ?it/s]

Snosixtyboo commented 1 year ago

Hi,

I repull all the source code again (No more tweaks) and did the debug version that you recommended. I still have this error (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ python train.py -s data4 Optimizing Output folder: ./output/f910c547-5 [15/07 11:00:04] Tensorboard not available: not logging progress [15/07 11:00:04] Reading camera 11/11 [15/07 11:00:04] Loading Training Cameras [15/07 11:00:04] Loading Test Cameras [15/07 11:00:06] Number of points at initialisation : 1696 [15/07 11:00:06] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 87, in training loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image)) File "/home/peter/Desktop/Research/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:00<?, ?it/s] (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$

Then I added CUDA_......, I still got this eror (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data4 Optimizing Output folder: ./output/a5bd0541-a [15/07 11:01:22] Tensorboard not available: not logging progress [15/07 11:01:22] Reading camera 11/11 [15/07 11:01:22] Loading Training Cameras [15/07 11:01:22] Loading Test Cameras [15/07 11:01:24] Number of points at initialisation : 1696 [15/07 11:01:24] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 81, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/peter/Desktop/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]

Hi,

thanks for trying again! If you installed the debug version of the rasterizer, then there should be a message "Something went wrong" and a crash_fw.dump or a crash_bw.dump file that the rasterizer creates in the gaussian_splatting directory. Do you have this? If yes, can you upload it for us and we can take a look? If not, can you tell us how you went about installing the debug rasterizer?

In general, something basic seems to be wrong. Some allocation or memory is in the wrong place, since it crashes immediately. And since you tried with our data sets, it has nothing to do with the data, so we can rule that out. It must be something very basic about the setup...

Best, Bernhard

fasogbon commented 1 year ago

I basically followed your environment.yaml instruction for the first installations.

For the debug version of the rasterizer, i followed the instruction in (https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1634728295). There is no crash_fw.dump or a crash_bw.dump file created in the gaussian_splatting directory.

Snosixtyboo commented 1 year ago

I basically followed your environment.yaml instruction for the first installations.

For the debug version of the rasterizer, i followed the instruction in (#23 (comment)). There is no crash_fw.dump or a crash_bw.dump file created in the gaussian_splatting directory.

Hi,

I would be very interested in finding out what's going on here. Would you maybe be available for a Skype session? This should be less tedious than writing here.

fasogbon commented 1 year ago

Thank you for your help. I have tried the code on another PC and everything works. Unfortunately it still doenst work on this particular desktop (Even after changing to cuda toolkit 10.8). So, this makes me guessing if its related to this GPU device

Snosixtyboo commented 1 year ago

Hi @fasogbon @ibrahimrazu I finally managed to do the debug version of the rasterizer, I hope this will help. To use it, please do

git pull
git submodule update
pip uninstall diff-gaussian-rasterization (yes)
pip install submodules/diff-gaussian-rasterization

and then run what failed before with --debug. This is slow: so if it takes a while for the error to appear, you can also use --debug_from <iteration> to start debugging only at a certain point. If everything goes well, you should get an error message and a snapshot_fw or snapshot_bw file in the gaussian_splatting directory. If you could forward this file to us, we could take a look to see if we find something wrong!

Best, Bernhard

Jerry18231174 commented 1 year ago

Hi, I also ran into this error when I do python train.py -s data/lego/ --debug. The complete output is shown below.

My error directly occurs in gaussian_keys_unsorted[off] = key; in duplicateWithKeys(). It is caused by incorrect retrieval of variable int num_rendered; in CHECK_CUDA(cudaMemcpy(&num_rendered, geomState.point_offsets + P - 1, sizeof(int), cudaMemcpyDeviceToHost), debug);, which is further caused by cub::DeviceScan::InclusiveSum(geomState.scanning_space, geomState.scan_size, geomState.tiles_touched, geomState.point_offsets, P).

Strangely, InclusiveSum() only calculated 81408 out of 100000 points. When P is decreased to 50000, only ~40000 items are calculated. This situation can be reproduced on two Ubuntu systems, with CUDA (nvcc) version varying from 11.6/11.7/11.8 to 12.1. Seems like CUDA version is not to blame.

Although this case can be compromised by wrapping InclusiveSum into another function with larger num_items and larger space, the following cub::DeviceRadixSort::SortPairs() function also seem to causing an error under cuda_gdb(sorting function's problem cannot be solved by this trick). And in turn causing another illegal memory access in FORWARD::render()

Any one known how to solve this problem? Should I make some change to include and library path of CUDA?

Optimizing Output folder: ./output/dc1f9e24-7 [10/09 15:23:06] Tensorboard not available: not logging progress [10/09 15:23:06] Found transforms_train.json file, assuming Blender data set! [10/09 15:23:06] Reading Training Transforms [10/09 15:23:06] Reading Test Transforms [10/09 15:23:10] Loading Training Cameras [10/09 15:23:16] Loading Test Cameras [10/09 15:23:18] Number of points at initialisation : 100000 [10/09 15:23:18] Training progress: 0%| | 0/30000 [00:00<?, ?it/s] [CUDA ERROR] in cuda_rasterizer/rasterizer_impl.cu Line 334: an illegal memory access was encountered An error occured in forward. Please forward snapshot_fw.dump for debugging. [10/09 15:23:19] Traceback (most recent call last): File "train.py", line 218, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from) File "train.py", line 83, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/jerry/Documents/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 93, in render cov3D_precomp = cov3D_precomp) File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 226, in forward raster_settings, File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 41, in rasterize_gaussians raster_settings, File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 97, in forward raise ex File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 86, in forward num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args) RuntimeError: an illegal memory access was encountered Training progress: 0%|

pangyyyyy commented 1 year ago

@Jerry18231174 have you managed to solve the error?

Jerry18231174 commented 1 year ago

@Jerry18231174 have you managed to solve the error?

Yes, I replaced "pip install submodule/gaussian_splatting" with ninja in-time compiling, and everything just seem to work fine.

ljjTYJR commented 11 months ago

@Jerry18231174 have you managed to solve the error?

Yes, I replaced "pip install submodule/gaussian_splatting" with ninja in-time compiling, and everything just seem to work fine.

Hi, I am not familiar with ninja, could you explain in a bit more details?

2693748650 commented 3 months ago

thanks! after running cuda-memcheck i got this:

Traceback (most recent call last): File "train.py", line 204, in training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations) File "train.py", line 75, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/media/sdc/merf_research/gaussian_mixture/gaussian-splatting/gaussian_renderer/init.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:01<?, ?it/s] ========= ERROR SUMMARY: 505 errors

based on the logs, it seems like the problem with rasterizer

hi！Could you please explain in detail how to solve this problem? I have also met the same problem, and I am struggling to know how to solve it.

graphdeco-inria / gaussian-splatting

CUDA error while running #23