lkeab / gaussian-grouping

[ECCV'2024] Gaussian Grouping for open-world Anything reconstruction, segmentation and editing.
https://arxiv.org/abs/2312.00732
Apache License 2.0
681 stars 42 forks source link

Training and fine-tuning are too slow #12

Open ThePassedWind opened 10 months ago

ThePassedWind commented 10 months ago

To train the bear model (original loss + 2d id loss + 3d reg loss), it would cost at least 10h. To fine-tune the inpainted model, it would cost at least 4h. Device: a 3090 GPU.

ymq2017 commented 10 months ago

Hi, this training time seems a bit strange.

I have verified that our code of inpainting on the bear dataset with full resolution takes about 40 minutes. I use one A6000 GPU which has a speed similar to 3090. Here is the log inpaint_bear.log. In our paper we use A100 and less iterations, the finetuning time is even shorter.

Could you check your environment like torch and cuda version, or the cpu usage?

nviolante25 commented 10 months ago

Hi, Thanks for open-sourcing your project!

To add to this issue, I trained Gaussian Grouping on the lerf figurines dataset, and it took 1h20 (same scene on original 3DGS takes about 20min). I'm using CUDA 12.1 with PyTorch 2.1.2 Also, rendering the 299 images after training with the 30k checkpoint takes about 15min, and the original 3DGS takes 4min.

ymq2017 commented 10 months ago

Hi, Thanks for open-sourcing your project!

To add to this issue, I trained Gaussian Grouping on the lerf figurines dataset, and it took 1h20 (same scene on original 3DGS takes about 20min). I'm using CUDA 12.1 with PyTorch 2.1.2 Also, rendering the 299 images after training with the 30k checkpoint takes about 15min, and the original 3DGS takes 4min.

Hi, thanks for your information! We use CUDA11.3 with PyTorch 1.12.1. There are also some ways to reduce the time for training and rendering.

For training, you can increase the interval for 3d reg loss to reduce the time.

For rendering, you can turn off some visualization code like PCA to reduce the time. In rendering, original Gaussian Splatting does not have mask prediction visualization and feature PCA visualization so the total time is very short.

ThePassedWind commented 10 months ago

Hi, Thanks for open-sourcing your project! To add to this issue, I trained Gaussian Grouping on the lerf figurines dataset, and it took 1h20 (same scene on original 3DGS takes about 20min). I'm using CUDA 12.1 with PyTorch 2.1.2 Also, rendering the 299 images after training with the 30k checkpoint takes about 15min, and the original 3DGS takes 4min.

Hi, thanks for your information! We use CUDA11.3 with PyTorch 1.12.1. There are also some ways to reduce the time for training and rendering.

For training, you can increase the interval for 3d reg loss to reduce the time.

For rendering, you can turn off some visualization code like PCA to reduce the time. In rendering, original Gaussian Splatting does not have mask prediction visualization and feature PCA visualization so the total time is very short.

Thanks! I would try again!

ThePassedWind commented 10 months ago

Maybe it's my GPU problem, the GPU is too old. And I have another question about training and fine-tuning stages: It is obvious that as the training goes on, why does more and more GPU storage to be allocated? At the same time, epoch N is much faster than epoch M(M>N).

ymq2017 commented 10 months ago

Maybe it's my GPU problem, the GPU is too old. And I have another question about training and fine-tuning stages: It is obvious that as the training goes on, why does more and more GPU storage to be allocated? At the same time, epoch N is much faster than epoch M(M>N).

Yes this is normal. Because Gaussian Splatting uses Adaptive Density Control and the number of points will grow in the training process. Usually 30K iters can increase the number of points by 1-2 orders of magnitude.

Neal2020GitHub commented 10 months ago

Hi, Thanks for open-sourcing your project!

To add to this issue, I trained Gaussian Grouping on the lerf figurines dataset, and it took 1h20 (same scene on original 3DGS takes about 20min). I'm using CUDA 12.1 with PyTorch 2.1.2 Also, rendering the 299 images after training with the 30k checkpoint takes about 15min, and the original 3DGS takes 4min.

Hi, @nviolante25 May I ask how you solved the OOM issue? I use a single 4090 and it raises the error of out of memory. Thank you!

nviolante25 commented 10 months ago

Hi @Neal2020GitHub , I haven't had oom issues, it worked directly in my case.

ThePassedWind commented 10 months ago

Hi, Thanks for open-sourcing your project! To add to this issue, I trained Gaussian Grouping on the lerf figurines dataset, and it took 1h20 (same scene on original 3DGS takes about 20min). I'm using CUDA 12.1 with PyTorch 2.1.2 Also, rendering the 299 images after training with the 30k checkpoint takes about 15min, and the original 3DGS takes 4min.

Hi, @nviolante25 May I ask how you solved the OOM issue? I use a single 4090 and it raises the error of out of memory. Thank you!

Hi, @Neal2020GitHub I also faced the same problem and I may have some ideas about it. Could we discuss about it? I'm a Msc student in PolyU(plan to apply Phd this year). You can add my Wechat: hp1499931489.

haoyuhsu commented 9 months ago

3090 might not be compatible to CUDA 11.3. I've also tested on 3090 and it works like charm. So, I would suggest you to install a latest CUDA version (~11.7 or 11.8)

MisEty commented 9 months ago

Hi, Thanks for open-sourcing your project! To add to this issue, I trained Gaussian Grouping on the lerf figurines dataset, and it took 1h20 (same scene on original 3DGS takes about 20min). I'm using CUDA 12.1 with PyTorch 2.1.2 Also, rendering the 299 images after training with the 30k checkpoint takes about 15min, and the original 3DGS takes 4min.

Hi, @nviolante25 May I ask how you solved the OOM issue? I use a single 4090 and it raises the error of out of memory. Thank you!

I use a single 4090 to finish the training with the following tricks:

  1. raw 3DGS load all image in GPU at the dataloader init, and I load images in local memory instead. You can change the data_device to cpu in "scene/cameras.py"
  2. use torch.cuda.empty_cache() in each iteration to release CUDA memory