oppo-us-research / SpacetimeGaussians

[CVPR 2024] Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis
https://oppo-us-research.github.io/SpacetimeGaussians-website/
Other
585 stars 43 forks source link

Training errors with n3d dataset #52

Closed liyw420 closed 4 months ago

liyw420 commented 4 months ago

Hello! Thanks for your great work which inspires me a lot!!! I got some errors when trying to train your model using Neural 3D Dataset. The first one is "terminate called after throwing an instance of 'thrust::system::system_error' what(): CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered"

python train.py --quiet --eval --config configs/n3d_full/cut_roasted_beef.json --model_path model/n3d/cut_roasted_beef_full --source_path data/n3d/cut_roasted_beef/colmap_0 
Optimizing model/n3d/cut_roasted_beef_full
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered

And the second one is "RuntimeError: numel: integer multiplication overflow"

python train.py --quiet --eval --config configs/n3d_full/cut_roasted_beef.json --model_path model/n3d/cut_roasted_beef_full --source_path data/n3d/cut_roasted_beef/colmap_0 
Optimizing model/n3d/cut_roasted_beef_full
Training progress:   0%|                                                  | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 397, in <module>
    train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=10, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
  File "train.py", line 134, in train
    render_pkg = render(viewpoint_cam, gaussians, pipe, background,  override_color=None,  basicfunction=rbfbasefunction, GRsetting=GRsetting, GRzer=GRzer)
  File "/home/vincent/Research/code/SpacetimeGaussians/thirdparty/gaussian_splatting/renderer/__init__.py", line 101, in train_ours_full
    cov3D_precomp = cov3D_precomp)
  File "/home/vincent/ProgramFiles/miniconda3/envs/stgs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/vincent/ProgramFiles/miniconda3/envs/stgs/lib/python3.7/site-packages/diff_gaussian_rasterization_ch9/__init__.py", line 195, in forward
    raster_settings, 
  File "/home/vincent/ProgramFiles/miniconda3/envs/stgs/lib/python3.7/site-packages/diff_gaussian_rasterization_ch9/__init__.py", line 37, in rasterize_gaussians
    raster_settings,
  File "/home/vincent/ProgramFiles/miniconda3/envs/stgs/lib/python3.7/site-packages/diff_gaussian_rasterization_ch9/__init__.py", line 79, in forward
    num_rendered, color, radii, geomBuffer, binningBuffer, imgBuffer, depth = _C.rasterize_gaussians(*args)
RuntimeError: numel: integer multiplication overflow
Training progress:   0%|                                                  | 0/30000 [00:00<?, ?it/s]

Could you give me some suggestions on how to solve these problems?

liyw420 commented 4 months ago

And other error happens in train.py, line 147, the code is "depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()"

RuntimeError
amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
  File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 147, in train
    depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()
  File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 397, in <module>
    train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=1, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
RuntimeError: amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

This is because the value of "depth[slectemask]" is "tensor([], device='cuda:0')"

liyw420 commented 4 months ago

The first error sometimes happens, sometimes doesn't. If it happens, it'll occur in thirdparty/gaussian_splatting/scene/init .py, line 152, the code is "self.gaussians.create_from_pcd(scene_info.point_cloud, self.cameras_extent)". I don't know why.

lizhan17 commented 4 months ago

how many points do you have ? it should have a log .txt file in your saving model folder.

liyw420 commented 4 months ago

Hi! For the cut_roasted_beef dataset with the Full Model, here is the points record:

iteration at 0
start training pointsnumber 179253
iteration at 0
start training pointsnumber 179253
iteration at 0
start training pointsnumber 205167
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 6812
iteration at 0
start training pointsnumber 6812
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
lizhan17 commented 4 months ago

did you change the duration value? the number of points should be fixed. at least 100 000 for 50 frames on n3d dataset.

liyw420 commented 4 months ago

Yes, i change the duration value to 10 to reduce the GPU memory usage. For the duration value of 50 frames on N3V dataset, the memory of GeForce RTX 4090 is not enough. Do you have other suggestions? Thank you very much!

lizhan17 commented 4 months ago

Stay tuned for a memory friendly version.

liyw420 commented 4 months ago

Thank you for you suggestions!

lizhan17 commented 4 months ago

i just pushed lazy loading (store image in cpu memory) and store gt image as int8 --data_device cpu --gtisint8 1 you can choose setting based on your device.

lizhan17 commented 4 months ago

Please reopen if you still meet Training errors.

liyw420 commented 3 months ago

Hello! Sorry to bother you again~ I retry your lazy loading version, but the CPU Memory is still not enough when loading train/test cameras in thirdparty/gaussian_splatting/scene/init .py, line 99:

self.train_cameras[resolution_scale] = cameraList_from_camInfosv2(scene_info.train_cameras, resolution_scale, args)

So I try another solution: For data preprocessing, I resize the image to 1/2 of its original size and extract the sparse points. This method reduces the GPU/CPU memory demands greatly, but other issues happen. 1 When run thirdparty/gaussian_splatting/scene/oursfull.py, line 192:

dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)

Most of the time an error occurs:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered

Sometimes not. If not, another error will occur in train.py, line 147, the code is "depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()"

amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
  File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 147, in train
    depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()
  File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 404, in <module>
    train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=args.duration, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
RuntimeError: amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

If I set

validdepthdict[viewpoint_cam.image_name] = 15 #torch.median(depth[slectemask]).item()   
depthdict[viewpoint_cam.image_name] = 15 #torch.amax(depth[slectemask]).item() 

Then an error will occur in thirdparty/gaussian_splatting/renderer/init .py, line 101, the code is

    rendered_image, radii, depth = rasterizer(
        means3D = means3D,
        means2D = means2D,
        shs = shs,
        colors_precomp = colors_precomp,
        opacities = opacity,
        scales = scales,
        rotations = rotations,
        cov3D_precomp = cov3D_precomp)

The error is

numel: integer multiplication overflow
  File "/home/vincent/Research/code/SpacetimeGaussians/thirdparty/gaussian_splatting/renderer/__init__.py", line 101, in train_ours_full
    cov3D_precomp = cov3D_precomp)
  File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 134, in train
    render_pkg = render(viewpoint_cam, gaussians, pipe, background,  override_color=None,  basicfunction=rbfbasefunction, GRsetting=GRsetting, GRzer=GRzer)
  File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 404, in <module>
    train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=args.duration, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
RuntimeError: numel: integer multiplication overflow

Could you give some advice on solving these problems? Thank you for your time and effort!

lizhan17 commented 3 months ago

it seems to me that there is a number overflow issue. i searched in 3dgs repo with numel: integer multiplication overflow, it seems that this occurs at larger resoluton image.

1) reduce the resolution by change "resolution": "2" -> "resolution": "4"

2) or you can change another machine to see if this problem still exists. (also tell me your gpu version, maybe your memory is not issue, just update to the latest nvidia driver and reinstall the environment)

3) you can use colmap to vis the input.ply, by download the colmap webgui (COLMAP-3.8-windows-cuda). then, click the COLMAP.bat , after that, open "File->import model from..." and choose input.ply (color will be black. just to check if the points are with good shape or too dense to take all the buffer in cuda.

liyw420 commented 3 months ago

Thank you for your patient and detailed suggestions! 1 I tried this method, but errors happen again: When run thirdparty/gaussian_splatting/scene/oursfull.py, line 192:

dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)

Most of the time an error occurs:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered

Sometimes not. If not, another error will occur in train.py, line 147, the code is "depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()"

amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
  File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 147, in train
    depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()
  File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 404, in <module>
    train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=args.duration, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
RuntimeError: amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.\

If I set

validdepthdict[viewpoint_cam.image_name] = 15 #torch.median(depth[slectemask]).item()   
depthdict[viewpoint_cam.image_name] = 15 #torch.amax(depth[slectemask]).item() 

Then an error will occur in thirdparty/gaussian_splatting/scene/oursfull.py, line 1182 during the training process: image

2 My GPU version is NVIDIA GeForce RTX 4090, nvidia driver is Driver Version: 535.183.01 image Actually I have deployed some of 3DGS models to my machine (original 3DGS, deformable 3DGS, 4DGS, etc.), and all of them work. Maybe I should try another machine... 3 Here is the point cloud from input.ply. image

Anyway, thank you again for your comments and suggestions!

lizhan17 commented 3 months ago
  1. what is your cuda toolkit version? you can get it by "nvcc --version",
  2. error occur in thirdparty/gaussian_splatting/scene/oursfull.py, line 1182 during the training process: that part code can be skipped by adding "emsstart: 30000" to the code.
  3. point clouds look good to me. you can train your model in a terminal withou vscode windows. i sometimes meet such error. you can still debug with vscode, but not training many steps with vscode.
  4. try adding CUDA_VISIBLE_DEVICES=0 before training command ?
liyw420 commented 3 months ago

Sorry for my late!

  1. my cuda toolkit version is 11.8. image
  2. Could you show me an example on how to skip this part of code by adding "emsstart: 30000"? Thanks!
  3. Train the model in the terminal will meet the same problems.
  4. Adding CUDA_VISIBLE_DEVICES=0 before training command would meet the same problem below:
    terminate called after throwing an instance of 'thrust::system::system_error'
    what():  CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered

    If not, a new error will occur:

    Optimizing model/n3d/cut_roasted_beef_full
    Training progress:   6%|█▎                    | 1720/30000 [00:09<02:51, 165.25it/s, Loss=0.5335779]Traceback (most recent call last):
    File "train.py", line 399, in <module>
    train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=40, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
    File "train.py", line 197, in train
    loss = getloss(opt, Ll1, ssim, image, gt_image, gaussians, radii)
    File "/home/vincent/Research/code/SpacetimeGaussians/helper_train.py", line 122, in getloss
    loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))   
    File "/home/vincent/Research/code/SpacetimeGaussians/thirdparty/gaussian_splatting/utils/loss_utils.py", line 41, in ssim
    window = window.cuda(img1.get_device())
    RuntimeError: CUDA error: an illegal memory access was encountered
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Training progress:   6%|█▎                    | 1730/30000 [00:09<02:34, 182.52it/s, Loss=0.5335779]

    Thank you!

lizhan17 commented 3 months ago

{ "resolution": 2, "model": "ours_full", "scaling_lr": 0.0015, "preprocesspoints": 0, "test_iteration": 25000, "duration": 50, "rdpip": "train_ours_full", "rgbfunction": "sandwich" } open the .json file and change it to to { "resolution": 2, "model": "ours_full", "scaling_lr": 0.0015, "preprocesspoints": 0, "test_iteration": 25000, "duration": 50, "rdpip": "train_ours_full", "rgbfunction": "sandwich", "emsstart: 30000" }

if other 3dgs code work, i suggest you just create a copy of their environment (reinstall their environment with a new env name), then just install our feature splatting code in the copied environment.

liyw420 commented 3 months ago

Ok I will try. Thank you!