Question regarding loss convergence and rendering results

HarryPeverell commented 2 days ago

Hi,

First, I would like to sincerely thank you for the incredible work you’ve done on this project.

I am currently working with a dataset that includes training scenes where the environment is relatively dark, and the shadows and background are difficult to distinguish. I’ve noticed that in this scenario, the loss is struggling to converge. Could you suggest any optimization techniques or learning rate adjustments that could potentially help in this situation?

Additionally, I’ve been using real_render.sh to render the model after training, but the resulting images are all completely black. I compared the initial point cloud with the final trained point cloud, and there are noticeable differences (such as the model learning some of the object's geometric features). However, despite these differences, the rendered images remain black. Here’s the command I’m using:

python render.py -m $model_root/$date/$subtask/T62_OLAT --iteration 100000 --skip_train --valid --use_nerual_phasefunc

Could you offer any insights into why this might be happening? Any advice on troubleshooting this issue would be greatly appreciated.

Thank you once again for your time and support!

Best regards,
Harry

RupertPaoZ commented 2 days ago

We haven't encountered the situation you mentioned. Perhaps you could provide more information.

HarryPeverell commented 2 days ago

Hi,

I wanted to provide additional context regarding the training process. Here are the key logs and details from the training so far:

Training Progress:

Training started and is currently progressing as follows:

Reading Training Transforms [01/12 00:18:18]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 218/218 [00:00<00:00, 341.68it/s]
Reading Test Transforms [01/12 00:18:19]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55/55 [00:00<00:00, 349.27it/s]
Loading Training Cameras [01/12 00:18:19]
Loading Test Cameras [01/12 00:18:20]
Number of points at initialisation :  100000 [01/12 00:18:20]
Training progress:   0%|                                                                                                                                | 0/100000 [00:00<?, ?it/s]/home/harry/miniconda3/envs/gs3/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Training progress:   2%|█▉                                                                                                 | 2000/100000 [01:42<1:03:14, 25.83it/s, Loss=0.1641454]
[ITER 2000] Evaluating test: L1 0.04547547376291318 PSNR 18.863650478016247 [01/12 00:20:04]

[ITER 2000] Evaluating train: L1 0.05333410538733006 PSNR 17.97425537109375 [01/12 00:20:04]
Training progress:   7%|██████▉                                                                                            | 7000/100000 [06:01<1:22:42, 18.74it/s, Loss=0.2030713]
[ITER 7000] Evaluating test: L1 0.045453572103923014 PSNR 18.85882528478449 [01/12 00:24:23]

[ITER 7000] Evaluating train: L1 0.05330754481256009 PSNR 17.970086669921876 [01/12 00:24:23]

[ITER 7000] Saving Gaussians [01/12 00:24:23]

[ITER 7000] Saving Checkpoint [01/12 00:24:24]
Training progress:  10%|█████████▊                                                                                        | 10000/100000 [08:45<1:18:42, 19.06it/s, Loss=0.2051660]
[ITER 10000] Evaluating test: L1 0.045453132181005045 PSNR 18.85900629216974 [01/12 00:27:07]

[ITER 10000] Evaluating train: L1 0.05330717749893665 PSNR 17.970232200622558 [01/12 00:27:07]

[ITER 10000] Saving Gaussians [01/12 00:27:07]

[ITER 10000] Saving Checkpoint [01/12 00:27:08]
Training progress:  15%|██████████████▋                                                                                   | 15000/100000 [13:13<1:14:16, 19.08it/s, Loss=0.1965908]
[ITER 15000] Evaluating test: L1 0.04545373577963222 PSNR 18.85876476981423 [01/12 00:31:35]

[ITER 15000] Evaluating train: L1 0.05330771952867508 PSNR 17.970035362243653 [01/12 00:31:36]

[ITER 15000] Saving Gaussians [01/12 00:31:36]

[ITER 15000] Saving Checkpoint [01/12 00:31:36]
Training progress:  20%|███████████████████▌                                                                              | 20000/100000 [17:52<1:13:09, 18.23it/s, Loss=0.2041495]
[ITER 20000] Evaluating test: L1 0.04544457827102054 PSNR 18.862969849326394 [01/12 00:36:14]

[ITER 20000] Evaluating train: L1 0.05329833216965199 PSNR 17.974169921875 [01/12 00:36:15]

[ITER 20000] Saving Gaussians [01/12 00:36:15]

[ITER 20000] Saving Checkpoint [01/12 00:36:15]
Training progress:  22%|█████████████████████▌                                                                            | 22000/100000 [19:44<1:09:00, 18.84it/s, Loss=0.1860033]set ansio param requires_grad:  True [01/12 00:38:04]
Training progress:  25%|████████████████████████▌                                                                         | 25000/100000 [22:35<1:14:57, 16.68it/s, Loss=0.1805277]
[ITER 25000] Evaluating test: L1 0.04545206376774744 PSNR 18.859650161049583 [01/12 00:40:57]

[ITER 25000] Evaluating train: L1 0.0533048078417778 PSNR 17.97158432006836 [01/12 00:40:57]
Training progress:  30%|█████████████████████████████▍                                                                    | 30000/100000 [27:17<1:04:22, 18.12it/s, Loss=0.1971461]
[ITER 30000] Evaluating test: L1 0.04545263096012852 PSNR 18.85929683338512 [01/12 00:45:39]

[ITER 30000] Evaluating train: L1 0.05330632068216801 PSNR 17.97079429626465 [01/12 00:45:39]

[ITER 30000] Saving Gaussians [01/12 00:45:39]

[ITER 30000] Saving Checkpoint [01/12 00:45:40]
Training progress:  40%|████████████████████████████████████████                                                            | 40000/100000 [36:31<52:32, 19.03it/s, Loss=0.1940327]
[ITER 40000] Evaluating test: L1 0.04544841460883617 PSNR 18.86095400723544 [01/12 00:54:54]

[ITER 40000] Evaluating train: L1 0.05329939126968384 PSNR 17.973218536376955 [01/12 00:54:54]

[ITER 40000] Saving Gaussians [01/12 00:54:54]

[ITER 40000] Saving Checkpoint [01/12 00:54:54]
Training progress:  50%|██████████████████████████████████████████████████                                                  | 50000/100000 [45:21<42:52, 19.44it/s, Loss=0.2157763]
[ITER 50000] Evaluating test: L1 0.04545306045223366 PSNR 18.859062056107955 [01/12 01:03:43]

[ITER 50000] Evaluating train: L1 0.05330672413110733 PSNR 17.970347595214843 [01/12 01:03:44]

[ITER 50000] Saving Gaussians [01/12 01:03:44]

[ITER 50000] Saving Checkpoint [01/12 01:03:44]
Training progress:  60%|████████████████████████████████████████████████████████████                                        | 60000/100000 [53:55<43:23, 15.36it/s, Loss=0.2125396]
[ITER 60000] Evaluating test: L1 0.04545225162397731 PSNR 18.859415609186346 [01/12 01:12:17]

[ITER 60000] Evaluating train: L1 0.05330640301108361 PSNR 17.970785331726074 [01/12 01:12:17]

[ITER 60000] Saving Gaussians [01/12 01:12:17]

[ITER 60000] Saving Checkpoint [01/12 01:12:18]
Training progress:  70%|████████████████████████████████████████████████████████████████████▌                             | 70000/100000 [1:01:09<20:08, 24.82it/s, Loss=0.2033624]
[ITER 70000] Evaluating test: L1 0.045451567084951836 PSNR 18.859666338833893 [01/12 01:19:31]

[ITER 70000] Evaluating train: L1 0.05330493971705437 PSNR 17.97126617431641 [01/12 01:19:31]

[ITER 70000] Saving Gaussians [01/12 01:19:31]

[ITER 70000] Saving Checkpoint [01/12 01:19:31]
Training progress:  80%|██████████████████████████████████████████████████████████████████████████████▍                   | 80000/100000 [1:07:11<11:44, 28.37it/s, Loss=0.2182817]
[ITER 80000] Evaluating test: L1 0.045453571426597505 PSNR 18.858839815313164 [01/12 01:25:32]

[ITER 80000] Evaluating train: L1 0.05330711416900158 PSNR 17.970286560058593 [01/12 01:25:32]

[ITER 80000] Saving Gaussians [01/12 01:25:32]

[ITER 80000] Saving Checkpoint [01/12 01:25:33]
Training progress:  90%|████████████████████████████████████████████████████████████████████████████████████████▏         | 90000/100000 [1:12:05<04:09, 40.13it/s, Loss=0.2074079]
[ITER 90000] Evaluating test: L1 0.045453494719483636 PSNR 18.858882106434216 [01/12 01:30:26]

[ITER 90000] Evaluating train: L1 0.053307444974780085 PSNR 17.970314979553223 [01/12 01:30:26]

[ITER 90000] Saving Gaussians [01/12 01:30:26]

[ITER 90000] Saving Checkpoint [01/12 01:30:26]
Training progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [1:16:35<00:00, 21.76it/s, Loss=0.1573329]

[ITER 100000] Evaluating test: L1 0.046480271050875835 PSNR 19.12953198172829 [01/12 01:34:56]

[ITER 100000] Evaluating train: L1 0.054898458346724514 PSNR 18.26572151184082 [01/12 01:34:56]

[ITER 100000] Saving Gaussians [01/12 01:34:56]

[ITER 100000] Saving Checkpoint [01/12 01:34:57]

Point Cloud Visualization

Below is a visualization of the point cloud at the beginning and end of the training process:

Left: Randomly initialized point cloud
Right: Trained point cloud after optimization Apologies for not being able to share the original training dataset due to confidentiality requirements from the lab. However, I have recreated the scene and the general appearance of the data in Blender for illustration purposes.

I suspect that the following two factors might be causing issues:

Shadow and Background Color Similarity: As you can observe, the shadows and the background color are quite similar. This could potentially cause issues with shadow splatting, where the system might struggle to distinguish between the shadow and the background, making the training loss hard to converge. I wonder if this is contributing to the difficulties we’re encountering during training.

Light Source and Camera Positioning: The light source and camera positions in the dataset were estimated using the positioning information from the capture device. However, this data was not processed through a sparse reconstruction method like COLMAP, which might affect the accuracy of the scene's geometry and alignment.

These are the two areas where I believe issues might be originating from. I would appreciate any feedback or suggestions regarding these points.

Thank you for taking the time to read through my issue. I would greatly appreciate any debugging ideas or suggestions you might have regarding these points to help improve the situation.

Thank you again for your help! Best regards,
Harry

RupertPaoZ commented 2 days ago

Perhaps you could verify if the initialized point cloud is positioned correctly (i.e., within the camera's view). Minor pose errors can be alleviated through pose optimization, so they should not be the cause of the issue.

gsrelight / gs-relight

Question regarding loss convergence and rendering results #7

Training Progress:

Point Cloud Visualization