Questions about the Code Replication

Liao-Jun commented 1 week ago

I used the GauUsceneV2 dataset in the Mask Extraction step, there are 33 images that cannot generate masks.The pictures are as follows. DJI_20231218125804_0090_Zenmuse-L1-mission.JPG DJI_20231218125141_0038_Zenmuse-L1-mission.JPG DJI_20231218124856_0018_Zenmuse-L1-mission.JPG DJI_20231218164839_0049_Zenmuse-L1-mission.JPG DJI_20231218125815_0095_Zenmuse-L1-mission.JPG DJI_20231217164423_0097_Zenmuse-L1-mission.JPG DJI_20231218124831_0007_Zenmuse-L1-mission.JPG DJI_20231218131349_0043_Zenmuse-L1-mission.JPG DJI_20231218165248_0107_Zenmuse-L1-mission.JPG DJI_20231218125807_0091_Zenmuse-L1-mission.JPG DJI_20231218164834_0047_Zenmuse-L1-mission.JPG DJI_20231218125810_0092_Zenmuse-L1-mission.JPG DJI_20231218124922_0030_Zenmuse-L1-mission.JPG DJI_20231218124919_0029_Zenmuse-L1-mission.JPG DJI_20231218125147_0040_Zenmuse-L1-mission.JPG DJI_20231218131404_0050_Zenmuse-L1-mission.JPG DJI_20231218164846_0052_Zenmuse-L1-mission.JPG DJI_20231218164843_0051_Zenmuse-L1-mission.JPG DJI_20231218124909_0024_Zenmuse-L1-mission.JPG DJI_20231218124914_0027_Zenmuse-L1-mission.JPG DJI_20231218124828_0006_Zenmuse-L1-mission.JPG DJI_20231218124910_0025_Zenmuse-L1-mission.JPG DJI_20231218131339_0039_Zenmuse-L1-mission.JPG DJI_20231218164525_0016_Zenmuse-L1-mission.JPG DJI_20231218125813_0094_Zenmuse-L1-mission.JPG DJI_20231218131027_0008_Zenmuse-L1-mission.JPG DJI_20231218125149_0041_Zenmuse-L1-mission.JPG DJI_20231218124836_0009_Zenmuse-L1-mission.JPG DJI_20231218132046_0005_Zenmuse-L1-mission.JPG DJI_20231218164841_0050_Zenmuse-L1-mission.JPG DJI_20231218125811_0093_Zenmuse-L1-mission.JPG DJI_20231218130141_0122_Zenmuse-L1-mission.JPG DJI_20231218164832_0046_Zenmuse-L1-mission.JPG Could you please tell me more details of the Code Replication? Thanks

saliteta commented 1 week ago

If you mean that there is no mask for the ground, buildings, or vegetation in the above images, that is okay. Our algorithm does not guarantee that it will segment something from each image. What if there is no ground, vegetation, or building in a picture? Of course, there would be no mask there.

We have encountered this problem before, which is why we created a very robust training procedure. It will only constrain the points that have a mask. If there is no corresponding mask, the constraint is based only on the image.

Liao-Jun commented 3 days ago

thank you for your reply，I reproduced it again and found that in the third step, the following error occurred after 3900 iterations: Training progress: 13%|█▎ | 3900/30000 [04:34<10:39, 40.79it/s, Loss=0.6774297] Training progress: 13%|█▎ | 3900/30000 [04:34<10:39, 40.79it/s, Loss=nan]
Training progress: 13%|█▎ | 3910/30000 [04:34<10:47, 40.31it/s, Loss=nan] Training progress: 13%|█▎ | 3910/30000 [04:34<10:47, 40.31it/s, Loss=nan] Training progress: 13%|█▎ | 3920/30000 [04:34<10:16, 42.31it/s, Loss=nan]shape loss out of bound [29/06 16:05:11] Traceback (most recent call last): File "train.py", line 277, in args.mask_path File "train.py", line 131, in training loss.backward() File "/home/ubuntu/anaconda3/envs/SA-GS/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/ubuntu/anaconda3/envs/SA-GS/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

For the 33 images that were not masked in the first step, I manually created masks with all values of -1.

saliteta commented 3 days ago

Did you use multiple you to train one scene?

On Sat, Jun 29, 2024 at 16:14 Liao-Jun @.***> wrote:

thank you for your reply，I reproduced it again and found that in the third step, the following error occurred after 3900 iterations: Training progress: 13%|█▎ | 3900/30000 [04:34<10:39, 40.79it/s, Loss=0.6774297] Training progress: 13%|█▎ | 3900/30000 [04:34<10:39, 40.79it/s, Loss=nan] Training progress: 13%|█▎ | 3910/30000 [04:34<10:47, 40.31it/s, Loss=nan] Training progress: 13%|█▎ | 3910/30000 [04:34<10:47, 40.31it/s, Loss=nan] Training progress: 13%|█▎ | 3920/30000 [04:34<10:16, 42.31it/s, Loss=nan]shape loss out of bound [29/06 16:05:11] Traceback (most recent call last): File "train.py", line 277, in args.mask_path File "train.py", line 131, in training loss.backward() File "/home/ubuntu/anaconda3/envs/SA-GS/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/ubuntu/anaconda3/envs/SA-GS/lib/python3.7/site-packages/torch/autograd/ init.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

For the 33 images that were not masked in the first step, I manually created masks with all values of -1.

— Reply to this email directly, view it on GitHub https://github.com/saliteta/SA-GS-CODE/issues/2#issuecomment-2198043113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVFYIGBZEAUEEREVPDBOXA3ZJZUFTAVCNFSM6AAAAABJXJSCQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGA2DGMJRGM . You are receiving this because you commented.Message ID: @.***>

saliteta commented 3 days ago

I mean multiple GPUs to train one scene?

Liao-Jun commented 3 days ago

No, no, only one.

Liao-Jun commented 3 days ago

RTX4090

saliteta commented 3 days ago

Which scene? Did you try to run one scene for multiple times? Does it always stop at iteration 3900?

As for my memory, I just encountered one time when I was using LFLS. A simple retry will get you on ur way?

saliteta commented 3 days ago

If it does not work, please let me know. We will test and solve that problem. One of my co worker is trying to reproduce the error.

Liao-Jun commented 3 days ago

Training progress: 10%|█ | 3090/30000 [04:23<13:37, 32.93it/s, Loss=0.5972668] Training progress: 10%|█ | 3100/30000 [04:23<13:30, 33.17it/s, Loss=0.5972668]shape loss out of bound [29/06 16:46:07] Traceback (most recent call last): File "train.py", line 277, in args.mask_path File "train.py", line 131, in training loss.backward() File "/home/ubuntu/anaconda3/envs/SA-GS/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/ubuntu/anaconda3/envs/SA-GS/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: Function _RasterizeGaussiansBackward returned an invalid gradient at index 2 - got [0, 0, 3] but expected shape compatible with [0, 16, 3] I re-executed and the first error occurred again.

Liao-Jun commented 3 days ago

The dataset is the lower campus in the GauUsceneV2 dataset

Liao-Jun commented 3 days ago

This error also occurred in the recurrence last week. At that time, I directly deleted the 33 pictures for the problem in the first step.

victor-cilay commented 1 day ago

thank you for your reply，I reproduced it again and found that in the third step, the following error occurred after 3900 iterations: Training progress: 13%|█▎ | 3900/30000 [04:34<10:39, 40.79it/s, Loss=0.6774297] Training progress: 13%|█▎ | 3900/30000 [04:34<10:39, 40.79it/s, Loss=nan] Training progress: 13%|█▎ | 3910/30000 [04:34<10:47, 40.31it/s, Loss=nan] Training progress: 13%|█▎ | 3910/30000 [04:34<10:47, 40.31it/s, Loss=nan] Training progress: 13%|█▎ | 3920/30000 [04:34<10:16, 42.31it/s, Loss=nan]shape loss out of bound [29/06 16:05:11] Traceback (most recent call last): File "train.py", line 277, in args.mask_path File "train.py", line 131, in training loss.backward() File "/home/ubuntu/anaconda3/envs/SA-GS/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/ubuntu/anaconda3/envs/SA-GS/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

For the 33 images that were not masked in the first step, I manually created masks with all values of -1.

Hi! Thanks for following our work, we tested on a different conda environment, and got the same results.

However, with pytorch 2.3.0 with cuda 11.8, the train process is flawless. So we released our entire virtual environment and please try it out in the latest version of our code!

saliteta / SA-GS-CODE

Questions about the Code Replication #2