graphdeco-inria / hierarchical-3d-gaussians

Official implementation of the SIGGRAPH 2024 paper "A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large Datasets"
Other
847 stars 76 forks source link

No grads! error #9

Open paujar opened 1 month ago

paujar commented 1 month ago

Hello,

I really appreciate what you guys have been doing, for my first time with these splats / nerfs, this is the first time you get really good instructions and everything works out of box, allmost :-)

I ran all the processes as you instructed and everything went fine, then when training chunk 2_2 it fails; I have no idea what to do, if you have some thoughts what might be wrong, please help:

python scripts/full_train.py --project_dir /opt/photogrammetry/eno_dataset

creating output dir: /opt/photogrammetry/eno_dataset/output
Optimizing /opt/photogrammetry/eno_dataset/output/scaffold
Output folder: /opt/photogrammetry/eno_dataset/output/scaffold [25/07 19:09:40]
Converting point3d.bin to .ply, will happen only the first time you open the scene. [25/07 19:09:41]
Reading camera 1298/1298 [25/07 19:09:42]
0 test images [25/07 19:09:42]
1298 train images [25/07 19:09:42]
Making Training Dataset [25/07 19:09:42]
Making Test Dataset [25/07 19:09:42]
Number of points at initialisation :  576036 [25/07 19:09:43]
Training progress:   0%|                                                                                                                                                                                                                                 | 0/30000 [00:00<?, ?it/s][ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:09:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:09:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:09:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:09:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:09:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:09:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:09:43]
 [25/07 19:09:43]
Training progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30000/30000 [15:56<00:00, 30.77it/s, Loss=0.0058064, Size=576036, Peak memory=780005376]
[ITER 30000] Saving Gaussians [25/07 19:25:39]
Training progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30000/30000 [15:57<00:00, 31.33it/s, Loss=0.0058064, Size=576036, Peak memory=780005376]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:25:41]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:25:41]
 [25/07 19:25:41]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:25:41]
 [25/07 19:25:41]

Training complete. [25/07 19:25:41]
Training chunk 2_2
Optimizing /opt/photogrammetry/eno_dataset/output/trained_chunks/2_2
Output folder: /opt/photogrammetry/eno_dataset/output/trained_chunks/2_2 [25/07 19:25:42]
Converting point3d.bin to .ply, will happen only the first time you open the scene. [25/07 19:25:42]
Reading camera 107/107 [25/07 19:25:42]
0 test images [25/07 19:25:42]
107 train images [25/07 19:25:42]
Making Training Dataset [25/07 19:25:42]
Making Test Dataset [25/07 19:25:42]
Number of points at initialisation :  318530 [25/07 19:25:42]
Training progress:   0%|                                                                                                                                                                                                                                 | 0/30000 [00:00<?, ?it/s][ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:25:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:25:43]
 [25/07 19:25:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:25:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:25:43]
 [25/07 19:25:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:25:43]
[ INFO ] Encountered quite large input images (>1.6K pixels width), rescaling to 1.6K.
 If this is not desired, please explicitly specify '--resolution/-r' as 1 [25/07 19:25:43]
No grads! [25/07 19:25:43]
No grads! [25/07 19:27:25]
Traceback (most recent call last):
  File "/opt/photogrammetry/hierarchical-3d-gaussians/train_single.py", line 239, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "/opt/photogrammetry/hierarchical-3d-gaussians/train_single.py", line 128, in training
    ema_loss_for_log = 0.4 * photo_loss.item() + 0.6 * ema_loss_for_log
                             ^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Training progress:   0%|                                                                                                                                                                                                                                 | 0/30000 [01:42<?, ?it/s]
Error executing train_single: Command 'python -u train_single.py --save_iterations -1 -i ../../rectified/images -d ../../rectified/depths --scaffold_file /opt/photogrammetry/eno_dataset/output/scaffold/point_cloud/iteration_30000 --skybox_locked -s /opt/photogrammetry/eno_dataset/camera_calibration/chunks/2_2 --model_path /opt/photogrammetry/eno_dataset/output/trained_chunks/2_2 --bounds_file /opt/photogrammetry/eno_dataset/camera_calibration/chunks/2_2' returned non-zero exit status 1.
White-Mask-230 commented 1 month ago

Same error https://github.com/graphdeco-inria/hierarchical-3d-gaussians/issues/6 we are searching the problem. Everything you can contribute of the investigation of the problem is welcome