Closed ibrahimrazu closed 1 year ago
Hi, can you run with cuda-memcheck ?
On Wed, Jul 12, 2023 at 8:44 PM Md Ibrahim Khalil @.***> wrote:
Hi, Thanks a lot for sharing your great work. I've built the environment properly and while running the tanks and temples dataset, getting following error:
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 204, in training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations) File "train.py", line 81, in training loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image)) File "/media/sdc/merf_research/gaussian_mixture/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Any idea how to resolve? My environment matches exactly with your repo
— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYJFUIX6OSE77RJ7HIDXP5VKBANCNFSM6AAAAAA2IKOKRA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@grgkopanas hi, you mean running some .cu script with cuda-memcheck? Sorry i did not understand
You can run from the command line: cuda-memcheck python script.py [cli args]
On Wed, Jul 12, 2023, 21:03 Md Ibrahim Khalil @.***> wrote:
@grgkopanas https://github.com/grgkopanas hi, you mean running some .cu script with cuda-memcheck? Sorry i did not understand
— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633512091, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYJHBJ2G4CCHAOUBYNDXP5XPXANCNFSM6AAAAAA2IKOKRA . You are receiving this because you were mentioned.Message ID: @.***>
thanks! after running cuda-memcheck i got this:
Traceback (most recent call last):
File "train.py", line 204, in
based on the logs, it seems like the problem with rasterizer
is that all the logs, what about the 505 errors? :D
A glimpse of that :)
========= Invalid global write of size 8 ========= at 0x000006d0 in duplicateWithKeys(int, float2 const , float const , unsigned int const , unsigned long, unsigned int, int, dim3) ========= by thread (160,0,0) in block (211,0,0) ========= Address 0x7f63d1d08820 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2e9441] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 [0x1433c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 (cudaLaunchKernel + 0x1d8) [0 x69c38] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so [0x3e4 0c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so (Z62 _device_stub__Z17duplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3iPK6float2PKfPKjPmPjPiR4dim3 + 0x1e9) [0x3bf69] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so (_Z17d uplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3 + 0x49) [0x3bfcc] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so (_ZN14 CudaRasterizer10Rasterizer7forwardESt8functionIFPcmEES4_S4_iiiPKfiiS6_S6_S6_S6_S6_fS6_S6_S6_S6_S6_ffbPfPi + 0x519) [0x3b4eb] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so (_Z22R asterizeGaussiansCUDARKN2at6TensorES2_S2_S2_S2_S2_fS2_S2_S2_ffiiS2_iS2_b + 0x94d) [0x5e597] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so [0x5d2 45] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/_C.cpython-37m-x86_64-linux-gnu.so [0x57b 52] ========= Host Frame:python (PyCFunction_Call + 0xa0) [0xd5990] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x558e) [0xb882e] ========= Host Frame:python (_PyFunction_FastCallDict + 0x116) [0xcce46] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/libtorch_python.so (_Z17THPFunction_applyP7objectS0 + 0x5 d6) [0x696c96] ========= Host Frame:python (_PyMethodDef_RawFastCallKeywords + 0x1fb) [0xbb4ab] ========= Host Frame:python [0xbaf40] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x469a) [0xb793a] ========= Host Frame:python (_PyFunction_FastCallKeywords + 0x106) [0xc61f6] ========= Host Frame:python [0xbae2f] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x971) [0xb3c11] ========= Host Frame:python (_PyEval_EvalCodeWithName + 0x201) [0xb2041] ========= Host Frame:python (_PyFunction_FastCallDict + 0x2d6) [0xcd006] ========= Host Frame:python [0xd5ec0] ========= Host Frame:python (PyObject_Call + 0x51) [0xd35a1] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x1ea8) [0xb5148]
Can you share the dataset with us?
On Wed, Jul 12, 2023 at 9:40 PM Md Ibrahim Khalil @.***> wrote:
A glimpse of that :)
========= Invalid global write of size 8 ========= at 0x000006d0 in duplicateWithKeys(int, float2 const , float const , unsigned int const , unsigned long, unsigned int, int, dim3) ========= by thread (160,0,0) in block (211,0,0) ========= Address 0x7f63d1d08820 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2e9441] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 [0x1433c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 (cudaLaunchKernel + 0x1d8) [0 x69c38] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x3e4 0c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (Z62 _device_stub__Z17duplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3iPK6float2PKfPKjPmPjPiR4dim3
- 0x1e9) [0x3bf69] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_Z17d uplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3 + 0x49) [0x3bfcc] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_ZN14 CudaRasterizer10Rasterizer7forwardESt8functionIFPcmEES4_S4_iiiPKfiiS6_S6_S6_S6_S6_fS6_S6_S6_S6_S6_ffbPfPi
- 0x519) [0x3b4eb] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_Z22R asterizeGaussiansCUDARKN2at6TensorES2_S2_S2_S2_S2_fS2_S2_S2_ffiiS2_iS2_b + 0x94d) [0x5e597] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x5d2 45] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x57b 52] ========= Host Frame:python (PyCFunction_Call + 0xa0) [0xd5990] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x558e) [0xb882e] ========= Host Frame:python (_PyFunction_FastCallDict + 0x116) [0xcce46] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/libtorch_python.so (Z17THPFunction_applyP7_objectS0 + 0x5 d6) [0x696c96] ========= Host Frame:python (_PyMethodDef_RawFastCallKeywords + 0x1fb) [0xbb4ab] ========= Host Frame:python [0xbaf40] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x469a) [0xb793a] ========= Host Frame:python (_PyFunction_FastCallKeywords + 0x106) [0xc61f6] ========= Host Frame:python [0xbae2f] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x971) [0xb3c11] ========= Host Frame:python (_PyEval_EvalCodeWithName + 0x201) [0xb2041] ========= Host Frame:python (_PyFunction_FastCallDict + 0x2d6) [0xcd006] ========= Host Frame:python [0xd5ec0] ========= Host Frame:python (PyObject_Call + 0x51) [0xd35a1] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x1ea8) [0xb5148]
— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633534673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYNCZZWEI3NNFJGFLLDXP535BANCNFSM6AAAAAA2IKOKRA . You are receiving this because you were mentioned.Message ID: @.***>
sorry I just realized you mentioned tanks and temples, which scene are you trying?
On Wed, Jul 12, 2023 at 9:42 PM George Kopanas @.***> wrote:
Can you share the dataset with us?
On Wed, Jul 12, 2023 at 9:40 PM Md Ibrahim Khalil < @.***> wrote:
A glimpse of that :)
========= Invalid global write of size 8 ========= at 0x000006d0 in duplicateWithKeys(int, float2 const , float const , unsigned int const , unsigned long, unsigned int, int, dim3) ========= by thread (160,0,0) in block (211,0,0) ========= Address 0x7f63d1d08820 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2e9441] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 [0x1433c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/../../../../libcudart.so.11.0 (cudaLaunchKernel + 0x1d8) [0 x69c38] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x3e4 0c] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (Z62 _device_stub__Z17duplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3iPK6float2PKfPKjPmPjPiR4dim3
- 0x1e9) [0x3bf69] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_Z17d uplicateWithKeysiPK6float2PKfPKjPmPjPi4dim3 + 0x49) [0x3bfcc] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_ZN14 CudaRasterizer10Rasterizer7forwardESt8functionIFPcmEES4_S4_iiiPKfiiS6_S6_S6_S6_S6_fS6_S6_S6_S6_S6_ffbPfPi
- 0x519) [0x3b4eb] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so (_Z22R asterizeGaussiansCUDARKN2at6TensorES2_S2_S2_S2_S2_fS2_S2_S2_ffiiS2_iS2_b
- 0x94d) [0x5e597] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x5d2 45] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussianrasterization/ C.cpython-37m-x86_64-linux-gnu.so [0x57b 52] ========= Host Frame:python (PyCFunction_Call + 0xa0) [0xd5990] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x558e) [0xb882e] ========= Host Frame:python (_PyFunction_FastCallDict + 0x116) [0xcce46] ========= Host Frame:/opt/conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/lib/libtorch_python.so (Z17THPFunction_applyP7_objectS0 + 0x5 d6) [0x696c96] ========= Host Frame:python (_PyMethodDef_RawFastCallKeywords + 0x1fb) [0xbb4ab] ========= Host Frame:python [0xbaf40] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x469a) [0xb793a] ========= Host Frame:python (_PyFunction_FastCallKeywords + 0x106) [0xc61f6] ========= Host Frame:python [0xbae2f] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x971) [0xb3c11] ========= Host Frame:python (_PyEval_EvalCodeWithName + 0x201) [0xb2041] ========= Host Frame:python (_PyFunction_FastCallDict + 0x2d6) [0xcd006] ========= Host Frame:python [0xd5ec0] ========= Host Frame:python (PyObject_Call + 0x51) [0xd35a1] ========= Host Frame:python (_PyEval_EvalFrameDefault + 0x1ea8) [0xb5148]
— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633534673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYNCZZWEI3NNFJGFLLDXP535BANCNFSM6AAAAAA2IKOKRA . You are receiving this because you were mentioned.Message ID: @.***>
Its the same T&T dataset downloaded from here: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/datasets/input/tandt_db.zip
@grgkopanas thanks! i was trying truck
Its my environment:
Python version: 3.7.13 (default, Oct 18 2022, 18:57:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-debian-bookworm-sid Is CUDA available: True CUDA runtime version: 11.7.99 GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB GPU 1: Tesla V100-PCIE-32GB GPU 2: Tesla V100-PCIE-32GB GPU 3: Tesla V100-PCIE-32GB GPU 4: Tesla V100-PCIE-32GB GPU 5: Tesla V100-PCIE-32GB GPU 6: Tesla V100-PCIE-32GB GPU 7: Tesla V100-PCIE-32GB
Nvidia driver version: 515.48.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.21.5 [pip3] torch==1.12.1 [pip3] torchaudio==0.12.1 [pip3] torchvision==0.13.1 [conda] blas 1.0 mkl [conda] cudatoolkit 11.6.2 hfc3e2af_12 conda-forge [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libblas 3.9.0 12_linux64_mkl conda-forge [conda] libcblas 3.9.0 12_linux64_mkl conda-forge [conda] liblapack 3.9.0 12_linux64_mkl conda-forge [conda] mkl 2021.4.0 h8d4b97c_729 conda-forge [conda] mkl-service 2.4.0 py37h402132d_0 conda-forge [conda] mkl_fft 1.3.1 py37h3e078e5_1 conda-forge [conda] mkl_random 1.2.2 py37h219a48f_0 conda-forge [conda] numpy 1.21.5 py37h6c91a56_3 [conda] numpy-base 1.21.5 py37ha15fc14_3 [conda] pytorch 1.12.1 py3.7_cuda11.6_cudnn8.3.2_0 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchaudio 0.12.1 py37_cu116 pytorch [conda] torchvision 0.13.1 py37_cu116 pytorch
oooh boy that's a lot of GPUs, do you have a way to isolate your node to a single GPU, we dont utilize multiple GPUs anyway.
Best, George
On Wed, Jul 12, 2023 at 10:00 PM Md Ibrahim Khalil @.***> wrote:
Its my environment:
Python version: 3.7.13 (default, Oct 18 2022, 18:57:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-debian-bookworm-sid Is CUDA available: True CUDA runtime version: 11.7.99 GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB GPU 1: Tesla V100-PCIE-32GB GPU 2: Tesla V100-PCIE-32GB GPU 3: Tesla V100-PCIE-32GB GPU 4: Tesla V100-PCIE-32GB GPU 5: Tesla V100-PCIE-32GB GPU 6: Tesla V100-PCIE-32GB GPU 7: Tesla V100-PCIE-32GB
Nvidia driver version: 515.48.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.21.5 [pip3] torch==1.12.1 [pip3] torchaudio==0.12.1 [pip3] torchvision==0.13.1 [conda] blas 1.0 mkl [conda] cudatoolkit 11.6.2 hfc3e2af_12 conda-forge [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] libblas 3.9.0 12_linux64_mkl conda-forge [conda] libcblas 3.9.0 12_linux64_mkl conda-forge [conda] liblapack 3.9.0 12_linux64_mkl conda-forge [conda] mkl 2021.4.0 h8d4b97c_729 conda-forge [conda] mkl-service 2.4.0 py37h402132d_0 conda-forge [conda] mkl_fft 1.3.1 py37h3e078e5_1 conda-forge [conda] mkl_random 1.2.2 py37h219a48f_0 conda-forge [conda] numpy 1.21.5 py37h6c91a56_3 [conda] numpy-base 1.21.5 py37ha15fc14_3 [conda] pytorch 1.12.1 py3.7_cuda11.6_cudnn8.3.2_0 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] torchaudio 0.12.1 py37_cu116 pytorch [conda] torchvision 0.13.1 py37_cu116 pytorch
— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633548339, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGXXYNN34RTKUOX36YQ6JTXP56HVANCNFSM6AAAAAA2IKOKRA . You are receiving this because you were mentioned.Message ID: @.***>
I’m confining the training to gpu-0 only. It barely reaches 22 GB before throwing CUDA error here:
File "train.py", line 81, in training loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image)) File "/media/sdc/merf_research/gaussian_mixture/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered
I am not coming up with any bright ideas right now, I will discuss this with @Snosixtyboo tomorrow and maybe we can come up with something.
if you can pipe the logs to a file and send them it might be usefull.
Thanks! Meanwhile i’ll be trying to rebuild the environment with slightly updated PyTorch and CUDA.
Bests Ibrahim
On Thu, Jul 13, 2023 at 1:13 AM grgkopanas @.***> wrote:
I am not coming up with any bright ideas right now, I will discuss this with @Snosixtyboo https://github.com/Snosixtyboo tomorrow and maybe we can come up with something.
— Reply to this email directly, view it on GitHub https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1633558102, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXVK4EREA7P6LNZXPR5SSTXP57XDANCNFSM6AAAAAA2IKOKRA . You are receiving this because you authored the thread.Message ID: @.***>
Hi, thanks for raising this! Could you tell us the exact flags you use to run this? It seems the code fails immediately, in the very first iteration. I also don't understand how tanks and temples can require 22 GB of VRAM, for me it never goes beyond 10.
Could you maybe
Hi,
I've added experimental state dumping functionality. You should get it via
pip uninstall diff-gaussian-rasterization
cd <gaussian-splatting>/submodules/diff-gaussian-rasterization
git pull
git checkout debug
python setup.py install
Then, if the training crashes at some point, it should write a BIG state file (>1GB). If this can then be uploaded somewhere and you think it's worth it, we will gladly take a look at it. If this is too bulky, we could think about other ways of finding the bug, but they will be more time-consuming...
Hello, I am also having same problems. After various tweaks, this came up.
(gaussian_splatting) pppp@pppp:~/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data
Optimizing
Output folder: ./output/e7d171c6-1 [13/07 21:53:19]
Tensorboard not available: not logging progress [13/07 21:53:19]
Found transforms_train.json file, assuming Blender data set! [13/07 21:53:19]
Reading Training Transforms [13/07 21:53:19]
Reading Test Transforms [13/07 21:53:21]
Loading Training Cameras [13/07 21:53:21]
Loading Test Cameras [13/07 21:53:23]
Number of points at initialisation : 100000 [13/07 21:53:23]
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 208, in <module>
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations)
File "train.py", line 83, in training
loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))
File "/home/pppp/-----/gaussian-splatting/utils/loss_utils.py", line 41, in ssim
return _ssim(img1, img2, window, window_size, channel, size_average)
File "//home/pppp/-----/gaussian-splatting/utils/loss_utils.py", line 45, in _ssim
mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel)
RuntimeError: CUDA error: an illegal instruction was encountered
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Hello, I am also having same problems. After various tweaks, this came up.
(gaussian_splatting) pppp@pppp:~/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data Optimizing Output folder: ./output/e7d171c6-1 [13/07 21:53:19] Tensorboard not available: not logging progress [13/07 21:53:19] Found transforms_train.json file, assuming Blender data set! [13/07 21:53:19] Reading Training Transforms [13/07 21:53:19] Reading Test Transforms [13/07 21:53:21] Loading Training Cameras [13/07 21:53:21] Loading Test Cameras [13/07 21:53:23] Number of points at initialisation : 100000 [13/07 21:53:23] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 208, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations) File "train.py", line 83, in training loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image)) File "/home/pppp/-----/gaussian-splatting/utils/loss_utils.py", line 41, in ssim return _ssim(img1, img2, window, window_size, channel, size_average) File "//home/pppp/-----/gaussian-splatting/utils/loss_utils.py", line 45, in _ssim mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel) RuntimeError: CUDA error: an illegal instruction was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Hey,
that looks like a different error. Are you running this on the Nerf Synthetic Blender Dataset? Otherwise, we don't have support for arbitrary transforms-based data sets yet (see console output "assuming Blender data set!") Also, what were those tweaks?
I have applied it on DTU dataset with colmap undistortion applied. I will be delighted if you can at least get this data to work. I will attach the dataset here and error. Thank you a lot
(gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data2 Optimizing Output folder: ./output/4a22b6af-0 [14/07 02:22:33] Tensorboard not available: not logging progress [14/07 02:22:33] Reading camera 20/20 [14/07 02:22:33] Loading Training Cameras [14/07 02:22:33] Loading Test Cameras [14/07 02:22:36] Number of points at initialisation : 5305 [14/07 02:22:36] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 208, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations) File "train.py", line 77, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/peter/Desktop/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
I also have the same error when i used your tandt_db.zip dataset
Hi,
both the DTU dataset and TandT work for us locally, we cannot reproduce your error. What were those tweaks that you mentioned? Did you make tweaks to the code?
Also, [here] (https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1634728295) we describe how you could get a debug version of the rasterizer. This should create a crash dump that you can then forward to us, otherwise, I don't know how we could help since we can't reproduce your error.
Best, Bernhard
Hi,
I repull all the source code again (No more tweaks) and did the debug version that you recommended. I still have this error
(gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ python train.py -s data4 Optimizing Output folder: ./output/f910c547-5 [15/07 11:00:04] Tensorboard not available: not logging progress [15/07 11:00:04] Reading camera 11/11 [15/07 11:00:04] Loading Training Cameras [15/07 11:00:04] Loading Test Cameras [15/07 11:00:06] Number of points at initialisation : 1696 [15/07 11:00:06] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 87, in training loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image)) File "/home/peter/Desktop/Research/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:00<?, ?it/s] (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$
Then I added CUDA_......, I still got this eror
(gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data4 Optimizing Output folder: ./output/a5bd0541-a [15/07 11:01:22] Tensorboard not available: not logging progress [15/07 11:01:22] Reading camera 11/11 [15/07 11:01:22] Loading Training Cameras [15/07 11:01:22] Loading Test Cameras [15/07 11:01:24] Number of points at initialisation : 1696 [15/07 11:01:24] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 81, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/peter/Desktop/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
I comment out the # --Mapper.ba_global_function_tolerance=0.000001" in the convert.py colmap code so that I can use old version of colmap. I DO NOT think thats the problem since the tank and truck dataset provided in tandt_db.zip does not work either
(gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data3/tandt/truck Optimizing Output folder: ./output/6242231d-a [15/07 11:06:44] Tensorboard not available: not logging progress [15/07 11:06:44] Reading camera 251/251 [15/07 11:06:45] Loading Training Cameras [15/07 11:06:45] Loading Test Cameras [15/07 11:06:50] Number of points at initialisation : 136029 [15/07 11:06:50] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 81, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/peter/Desktop/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Some pc info:
`(gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ nvidia-smi
Sat Jul 15 11:09:30 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti On | 00000000:01:00.0 On | N/A |
| 30% 38C P8 33W / 260W | 524MiB / 11264MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1186 G /usr/lib/xorg/Xorg 288MiB | | 0 N/A N/A 1417 G /usr/bin/gnome-shell 125MiB | | 0 N/A N/A 3343 G ...9683677,17313697408044379519,262144 107MiB | +---------------------------------------------------------------------------------------+ (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ nvcc--version nvcc--version: command not found (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_Mar__8_18:18:20_PST_2022 Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0 (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ `
In addition, I have also tried to reduce the number of images in case of memory issues but still give errors
I also tried earlier suggestion with memorycheck, there was no error
(gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ cuda-memcheck python train.py -s data3/tandt/truck ========= CUDA-MEMCHECK ========= This tool is deprecated and will be removed in a future release of the CUDA toolkit ========= Please use the compute-sanitizer tool as a drop-in replacement Optimizing Output folder: ./output/49d3913d-c [15/07 11:13:35] Tensorboard not available: not logging progress [15/07 11:13:35] Reading camera 251/251 [15/07 11:13:36] Loading Training Cameras [15/07 11:13:36] Loading Test Cameras [15/07 11:13:41] Number of points at initialisation : 136029 [15/07 11:13:41] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 87, in training loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image)) File "/home/peter/Desktop/Research/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Hi,
I repull all the source code again (No more tweaks) and did the debug version that you recommended. I still have this error
(gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ python train.py -s data4 Optimizing Output folder: ./output/f910c547-5 [15/07 11:00:04] Tensorboard not available: not logging progress [15/07 11:00:04] Reading camera 11/11 [15/07 11:00:04] Loading Training Cameras [15/07 11:00:04] Loading Test Cameras [15/07 11:00:06] Number of points at initialisation : 1696 [15/07 11:00:06] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 87, in training loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image)) File "/home/peter/Desktop/Research/gaussian-splatting/utils/loss_utils.py", line 38, in ssim window = window.cuda(img1.get_device()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:00<?, ?it/s] (gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$
Then I added CUDA_......, I still got this eror
(gaussian_splatting) peter@peter:~/Desktop/Research/gaussian-splatting$ CUDA_LAUNCH_BLOCKING=1 python train.py -s data4 Optimizing Output folder: ./output/a5bd0541-a [15/07 11:01:22] Tensorboard not available: not logging progress [15/07 11:01:22] Reading camera 11/11 [15/07 11:01:22] Loading Training Cameras [15/07 11:01:22] Loading Test Cameras [15/07 11:01:24] Number of points at initialisation : 1696 [15/07 11:01:24] Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 213, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint) File "train.py", line 81, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/peter/Desktop/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: an illegal memory access was encountered Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Hi,
thanks for trying again! If you installed the debug version of the rasterizer, then there should be a message "Something went wrong" and a crash_fw.dump
or a crash_bw.dump
file that the rasterizer creates in the gaussian_splatting directory. Do you have this? If yes, can you upload it for us and we can take a look? If not, can you tell us how you went about installing the debug rasterizer?
In general, something basic seems to be wrong. Some allocation or memory is in the wrong place, since it crashes immediately. And since you tried with our data sets, it has nothing to do with the data, so we can rule that out. It must be something very basic about the setup...
Best, Bernhard
I basically followed your environment.yaml instruction for the first installations.
For the debug version of the rasterizer, i followed the instruction in (https://github.com/graphdeco-inria/gaussian-splatting/issues/23#issuecomment-1634728295). There is no crash_fw.dump or a crash_bw.dump file created in the gaussian_splatting directory.
I basically followed your environment.yaml instruction for the first installations.
For the debug version of the rasterizer, i followed the instruction in (#23 (comment)). There is no crash_fw.dump or a crash_bw.dump file created in the gaussian_splatting directory.
Hi,
I would be very interested in finding out what's going on here. Would you maybe be available for a Skype session? This should be less tedious than writing here.
Thank you for your help. I have tried the code on another PC and everything works. Unfortunately it still doenst work on this particular desktop (Even after changing to cuda toolkit 10.8). So, this makes me guessing if its related to this GPU device
Hi @fasogbon @ibrahimrazu I finally managed to do the debug version of the rasterizer, I hope this will help. To use it, please do
git pull
git submodule update
pip uninstall diff-gaussian-rasterization (yes)
pip install submodules/diff-gaussian-rasterization
and then run what failed before with --debug
. This is slow: so if it takes a while for the error to appear, you can also use --debug_from <iteration>
to start debugging only at a certain point. If everything goes well, you should get an error message and a snapshot_fw
or snapshot_bw
file in the gaussian_splatting
directory. If you could forward this file to us, we could take a look to see if we find something wrong!
Best, Bernhard
Hi,
I also ran into this error when I do python train.py -s data/lego/ --debug
. The complete output is shown below.
My error directly occurs in gaussian_keys_unsorted[off] = key;
in duplicateWithKeys()
. It is caused by incorrect retrieval of variable int num_rendered;
in CHECK_CUDA(cudaMemcpy(&num_rendered, geomState.point_offsets + P - 1, sizeof(int), cudaMemcpyDeviceToHost), debug);
, which is further caused by cub::DeviceScan::InclusiveSum(geomState.scanning_space, geomState.scan_size, geomState.tiles_touched, geomState.point_offsets, P)
.
Strangely, InclusiveSum()
only calculated 81408 out of 100000 points. When P
is decreased to 50000, only ~40000 items are calculated. This situation can be reproduced on two Ubuntu systems, with CUDA (nvcc) version varying from 11.6/11.7/11.8 to 12.1. Seems like CUDA version is not to blame.
Although this case can be compromised by wrapping InclusiveSum
into another function with larger num_items
and larger space, the following cub::DeviceRadixSort::SortPairs()
function also seem to causing an error under cuda_gdb
(sorting function's problem cannot be solved by this trick). And in turn causing another illegal memory access in FORWARD::render()
Any one known how to solve this problem? Should I make some change to include and library path of CUDA?
Optimizing Output folder: ./output/dc1f9e24-7 [10/09 15:23:06] Tensorboard not available: not logging progress [10/09 15:23:06] Found transforms_train.json file, assuming Blender data set! [10/09 15:23:06] Reading Training Transforms [10/09 15:23:06] Reading Test Transforms [10/09 15:23:10] Loading Training Cameras [10/09 15:23:16] Loading Test Cameras [10/09 15:23:18] Number of points at initialisation : 100000 [10/09 15:23:18] Training progress: 0%| | 0/30000 [00:00<?, ?it/s] [CUDA ERROR] in cuda_rasterizer/rasterizer_impl.cu Line 334: an illegal memory access was encountered An error occured in forward. Please forward snapshot_fw.dump for debugging. [10/09 15:23:19] Traceback (most recent call last): File "train.py", line 218, in <module> training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from) File "train.py", line 83, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/home/jerry/Documents/Research/gaussian-splatting/gaussian_renderer/__init__.py", line 93, in render cov3D_precomp = cov3D_precomp) File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 226, in forward raster_settings, File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 41, in rasterize_gaussians raster_settings, File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 97, in forward raise ex File "/home/jerry/.conda/envs/gaussian_splatting/lib/python3.7/site-packages/diff_gaussian_rasterization/__init__.py", line 86, in forward num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer = _C.rasterize_gaussians(*args) RuntimeError: an illegal memory access was encountered Training progress: 0%|
@Jerry18231174 have you managed to solve the error?
@Jerry18231174 have you managed to solve the error?
Yes, I replaced "pip install submodule/gaussian_splatting" with ninja in-time compiling, and everything just seem to work fine.
@Jerry18231174 have you managed to solve the error?
Yes, I replaced "pip install submodule/gaussian_splatting" with ninja in-time compiling, and everything just seem to work fine.
Hi, I am not familiar with ninja
, could you explain in a bit more details?
thanks! after running cuda-memcheck i got this:
Traceback (most recent call last): File "train.py", line 204, in training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations) File "train.py", line 75, in training render_pkg = render(viewpoint_cam, gaussians, pipe, background) File "/media/sdc/merf_research/gaussian_mixture/gaussian-splatting/gaussian_renderer/init.py", line 98, in render "visibility_filter" : radii > 0, RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Training progress: 0%| | 0/30000 [00:01<?, ?it/s] ========= ERROR SUMMARY: 505 errors
based on the logs, it seems like the problem with rasterizer
hi!Could you please explain in detail how to solve this problem? I have also met the same problem, and I am struggling to know how to solve it.
Hi, Thanks a lot for sharing your great work. I've built the environment properly and while running the tanks and temples dataset, getting following error:
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 204, in
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations)
File "train.py", line 81, in training
loss = (1.0 - opt.lambda_dssim) Ll1 + opt.lambda_dssim (1.0 - ssim(image, gt_image))
File "/media/sdc/merf_research/gaussian_mixture/gaussian-splatting/utils/loss_utils.py", line 38, in ssim
window = window.cuda(img1.get_device())
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Any idea how to resolve? My environment matches exactly with your repo