CUDA illegal memory access when training with web viewer running

Haven-Lau commented 2 weeks ago

Hi, first of all just wanted to thank you for this amazing project! I've wanted to leverage depth camera as a prior for training gs for a while and can't believe it took me this long to stumble upon this project.

I'm currently facing this issue when training with nerfstudio viewer on:

Traceback (most recent call last):
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\haven\miniconda3\envs\nerfstudio\Scripts\ns-train.exe\__main__.py", line 7, in <module>
    sys.exit(entrypoint())
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 262, in entrypoint
    main(
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 247, in main
    launch(
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 189, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\scripts\train.py", line 100, in train_loop
    trainer.train()
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\engine\trainer.py", line 261, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\utils\profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\engine\trainer.py", line 496, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\utils\profiler.py", line 112, in inner
    out = func(*args, **kwargs)
  File "C:\Users\haven\miniconda3\envs\nerfstudio\lib\site-packages\nerfstudio\pipelines\base_pipeline.py", line 302, in get_train_loss_dict
    metrics_dict = self.model.get_metrics_dict(model_outputs, batch)
  File "C:\Users\haven\code\nerfstudio\dn-splatter\dn_splatter\dn_model.py", line 750, in get_metrics_dict
    "rgb_mse": float(rgb_mse),
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

My data was captured with an azure kinect sensor using SAI, at first I used the included process_sai.py to preprocess the recorded data, but the transform.json output gave camera intrinsics that nerfstudio's undistort function didn't like where k4 wasn't 0, so I copied the camera intrinsic values from sai-cli process (which gave k1 k2 p1 p2 = 0, not sure the intrinsic values are important on the kinect) and training starts properly now.

When I run ns-train while having the nerfstudio web viewer on, at around 5000 - 7000 steps it would throw a CUDA illegal memory access error, however without the web viewer running it would run without complaining. I've tried running ns-train multiple times with and without the web viewer and it only complains when web viewer is running. Has anyone seen similar behaviors?

System info:

>>> torch.__version__
'2.1.2+cu118'

>conda list nerfstudio
# Name                    Version                   Build  Channel
nerfstudio                1.1.3                    pypi_0    pypi

>nvidia-smi
Sat Oct 19 21:03:41 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.94                 Driver Version: 560.94         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:06:00.0  On |                  N/A |
|  0%   62C    P2            212W /  350W |    3135MiB /  24576MiB |     76%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Thanks again

maturk commented 2 weeks ago

@Haven-Lau , I have seen a similar phenomena in my early experiments. What I found was that the viewer can crash if you move the camera in the viewer so that it does not see the Gaussian scene properly (like turning 180 degrees and looking at nothing). I wonder if this is the same issue for you.

Haven-Lau commented 2 weeks ago

@maturk Thanks for the quick reply!

Yes it does sound similar, I find it to crash regardless of where I'm looking (eventually if I have the viewer running) but it is definitely more likely to crash when I pan around a lot very quickly or stare at nothing. I wonder if it's some race conditions between ns-train dn-splatter and the viewer. However today I had my first crash crash without running the viewer. Is there a way to save checkpoints throughout the training process instead of only at 100%?

maturk commented 2 weeks ago

Just to make sure, you are using nerfstudio v1.1.3 and gsplat v1.0.0?

Haven-Lau commented 2 weeks ago

Correct

# Name                    Version                   Build  Channel
nerfstudio                1.1.3                    pypi_0    pypi
gsplat                    1.0.0                    pypi_0    pypi

Haven-Lau commented 2 weeks ago

I'm running windows hopefully that's not the cause, but I can try to spin up an ubuntu env at some point since I couldn't get the download scripts running on windows anyways due to the cli commands on different OS I think

maturk commented 2 weeks ago

Have u tried any other dataset if it occurs there? I am wondering if there are some issues with the optimization (densification/culling) due to the depth supervision. Pictures of the scene when or near the crash might help me debug as well. Maybe try turn off depth loss and see if crash still happens in that scenario.

Haven-Lau commented 2 weeks ago

For my own scene I turned on --pipeline.model.use-normal-loss True --pipeline.model.use-normal-tv-loss True and that caused it to crash at 51% (15400 steps), without normal loss it trains to 100% without crashing; I'm not loading any normal maps.

This is what it looked like before ~1000 steps before the crash (this time with viewer on it crashed at 12xxx steps intead of 15xxx) rgb8 depth8 normal8

This is one of the training input depth_00009 frame_00009

I'll try training again with one of the mushroom dataset and report back

Haven-Lau commented 2 weeks ago

I tried using mushroom honka kinect short raw dataset and processed the raw camera and depth mkv using process_sai.py, then ran

> ns-train dn-splatter --data data\honka_processed 
--pipeline.model.use-depth-loss True 
--pipeline.model.depth-lambda 0.2 
--pipeline.model.use-normal-loss True 
--pipeline.model.use-normal-tv-loss True 
--pipeline.model.normal-supervision depth 
normal-nerfstudio --load-normals False

This time I was able to use normal-loss and normal-tv-loss without crashing

However this time I saw similar degrading issue as https://github.com/maturk/dn-splatter/issues/68 where towards the end of the training process a bunch of big splats got introduced and the some surfaces now have holes rgb1 rgb2 Could you see if you can reproduce similar issues with the mushroom dataset using the same steps?

Eventually I want to figure out how to process my own raw kinect data using the same steps as demonstrated in the mushroom paper as the output from that seems to be very good, there seems to be quite a gap between processing raw kinect data using process_sai tool vs the preprocessed kinect data given by mushroom dataset, or perhaps do you think it is my ns-train params?

XuqianRen commented 1 week ago

Hi, @Haven-Lau may I ask which camera pose you use for mushroom dataset, mushroom dataparsers in dn-splatter also support kinect sequence, the command can like:

ns-train dn-splatter --data mushroom_sequence 
--pipeline.model.use-depth-loss True 
--pipeline.model.depth-lambda 0.2 
--pipeline.model.use-normal-loss True 
--pipeline.model.use-normal-tv-loss True 
--pipeline.model.normal-supervision depth 
mushroom --load-normals False --mode kinect

maturk / dn-splatter

CUDA illegal memory access when training with web viewer running #81