Closed liyw420 closed 4 months ago
And other error happens in train.py, line 147, the code is "depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()"
RuntimeError
amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 147, in train
depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()
File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 397, in <module>
train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=1, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
RuntimeError: amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
This is because the value of "depth[slectemask]" is "tensor([], device='cuda:0')"
The first error sometimes happens, sometimes doesn't. If it happens, it'll occur in thirdparty/gaussian_splatting/scene/init .py, line 152, the code is "self.gaussians.create_from_pcd(scene_info.point_cloud, self.cameras_extent)". I don't know why.
how many points do you have ? it should have a log .txt file in your saving model folder.
Hi! For the cut_roasted_beef dataset with the Full Model, here is the points record:
iteration at 0
start training pointsnumber 179253
iteration at 0
start training pointsnumber 179253
iteration at 0
start training pointsnumber 205167
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 6812
iteration at 0
start training pointsnumber 6812
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
iteration at 0
start training pointsnumber 68169
did you change the duration value? the number of points should be fixed. at least 100 000 for 50 frames on n3d dataset.
Yes, i change the duration value to 10 to reduce the GPU memory usage. For the duration value of 50 frames on N3V dataset, the memory of GeForce RTX 4090 is not enough. Do you have other suggestions? Thank you very much!
Stay tuned for a memory friendly version.
Thank you for you suggestions!
i just pushed lazy loading (store image in cpu memory) and store gt image as int8
--data_device cpu --gtisint8 1
you can choose setting based on your device.
Please reopen if you still meet Training errors.
Hello! Sorry to bother you again~ I retry your lazy loading version, but the CPU Memory is still not enough when loading train/test cameras in thirdparty/gaussian_splatting/scene/init .py, line 99:
self.train_cameras[resolution_scale] = cameraList_from_camInfosv2(scene_info.train_cameras, resolution_scale, args)
So I try another solution: For data preprocessing, I resize the image to 1/2 of its original size and extract the sparse points. This method reduces the GPU/CPU memory demands greatly, but other issues happen. 1 When run thirdparty/gaussian_splatting/scene/oursfull.py, line 192:
dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)
Most of the time an error occurs:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered
Sometimes not. If not, another error will occur in train.py, line 147, the code is "depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()"
amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 147, in train
depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()
File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 404, in <module>
train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=args.duration, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
RuntimeError: amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
If I set
validdepthdict[viewpoint_cam.image_name] = 15 #torch.median(depth[slectemask]).item()
depthdict[viewpoint_cam.image_name] = 15 #torch.amax(depth[slectemask]).item()
Then an error will occur in thirdparty/gaussian_splatting/renderer/init .py, line 101, the code is
rendered_image, radii, depth = rasterizer(
means3D = means3D,
means2D = means2D,
shs = shs,
colors_precomp = colors_precomp,
opacities = opacity,
scales = scales,
rotations = rotations,
cov3D_precomp = cov3D_precomp)
The error is
numel: integer multiplication overflow
File "/home/vincent/Research/code/SpacetimeGaussians/thirdparty/gaussian_splatting/renderer/__init__.py", line 101, in train_ours_full
cov3D_precomp = cov3D_precomp)
File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 134, in train
render_pkg = render(viewpoint_cam, gaussians, pipe, background, override_color=None, basicfunction=rbfbasefunction, GRsetting=GRsetting, GRzer=GRzer)
File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 404, in <module>
train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=args.duration, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
RuntimeError: numel: integer multiplication overflow
Could you give some advice on solving these problems? Thank you for your time and effort!
it seems to me that there is a number overflow issue. i searched in 3dgs repo with numel: integer multiplication overflow, it seems that this occurs at larger resoluton image.
1) reduce the resolution by change "resolution": "2" -> "resolution": "4"
2) or you can change another machine to see if this problem still exists. (also tell me your gpu version, maybe your memory is not issue, just update to the latest nvidia driver and reinstall the environment)
3) you can use colmap to vis the input.ply, by download the colmap webgui (COLMAP-3.8-windows-cuda). then, click the COLMAP.bat
, after that, open "File->import model from..." and choose input.ply (color will be black. just to check if the points are with good shape or too dense to take all the buffer in cuda.
Thank you for your patient and detailed suggestions! 1 I tried this method, but errors happen again: When run thirdparty/gaussian_splatting/scene/oursfull.py, line 192:
dist2 = torch.clamp_min(distCUDA2(torch.from_numpy(np.asarray(pcd.points)).float().cuda()), 0.0000001)
Most of the time an error occurs:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered
Sometimes not. If not, another error will occur in train.py, line 147, the code is "depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()"
amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.
File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 147, in train
depthdict[viewpoint_cam.image_name] = torch.amax(depth[slectemask]).item()
File "/home/vincent/Research/code/SpacetimeGaussians/train.py", line 404, in <module>
train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=args.duration, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
RuntimeError: amax(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.\
If I set
validdepthdict[viewpoint_cam.image_name] = 15 #torch.median(depth[slectemask]).item()
depthdict[viewpoint_cam.image_name] = 15 #torch.amax(depth[slectemask]).item()
Then an error will occur in thirdparty/gaussian_splatting/scene/oursfull.py, line 1182 during the training process:
2 My GPU version is NVIDIA GeForce RTX 4090, nvidia driver is Driver Version: 535.183.01 Actually I have deployed some of 3DGS models to my machine (original 3DGS, deformable 3DGS, 4DGS, etc.), and all of them work. Maybe I should try another machine... 3 Here is the point cloud from input.ply.
Anyway, thank you again for your comments and suggestions!
Sorry for my late!
terminate called after throwing an instance of 'thrust::system::system_error'
what(): CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered
If not, a new error will occur:
Optimizing model/n3d/cut_roasted_beef_full
Training progress: 6%|█▎ | 1720/30000 [00:09<02:51, 165.25it/s, Loss=0.5335779]Traceback (most recent call last):
File "train.py", line 399, in <module>
train(lp_extract, op_extract, pp_extract, args.save_iterations, args.debug_from, densify=args.densify, duration=40, rgbfunction=args.rgbfunction, rdpip=args.rdpip)
File "train.py", line 197, in train
loss = getloss(opt, Ll1, ssim, image, gt_image, gaussians, radii)
File "/home/vincent/Research/code/SpacetimeGaussians/helper_train.py", line 122, in getloss
loss = (1.0 - opt.lambda_dssim) * Ll1 + opt.lambda_dssim * (1.0 - ssim(image, gt_image))
File "/home/vincent/Research/code/SpacetimeGaussians/thirdparty/gaussian_splatting/utils/loss_utils.py", line 41, in ssim
window = window.cuda(img1.get_device())
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Training progress: 6%|█▎ | 1730/30000 [00:09<02:34, 182.52it/s, Loss=0.5335779]
Thank you!
{ "resolution": 2, "model": "ours_full", "scaling_lr": 0.0015, "preprocesspoints": 0, "test_iteration": 25000, "duration": 50, "rdpip": "train_ours_full", "rgbfunction": "sandwich" } open the .json file and change it to to { "resolution": 2, "model": "ours_full", "scaling_lr": 0.0015, "preprocesspoints": 0, "test_iteration": 25000, "duration": 50, "rdpip": "train_ours_full", "rgbfunction": "sandwich", "emsstart: 30000" }
if other 3dgs code work, i suggest you just create a copy of their environment (reinstall their environment with a new env name), then just install our feature splatting code in the copied environment.
Ok I will try. Thank you!
Hello! Thanks for your great work which inspires me a lot!!! I got some errors when trying to train your model using Neural 3D Dataset. The first one is "terminate called after throwing an instance of 'thrust::system::system_error' what(): CUDA free failed: cudaErrorIllegalAddress: an illegal memory access was encountered"
And the second one is "RuntimeError: numel: integer multiplication overflow"
Could you give me some suggestions on how to solve these problems?