gsgen3d / gsgen

[CVPR 2024] Text-to-3D using Gaussian Splatting
https://arxiv.org/abs/2309.16585
MIT License
761 stars 42 forks source link

W.R.T. training speed #16

Open KelestZ opened 10 months ago

KelestZ commented 10 months ago

Hi, thanks for the great efforts in gsgen. I'm wondering how do you compare the speed of the reimplemented GS with the original GS implementation? I've noticed that the concurrent works are quite fast in training.

For example, it's 2 mins when using dreamgaussian, as they use 500 training iterations for GS in the 1st stage. To be comparable, 15k steps in their case will take 0.2h. Additionally, in gaussiandreamer, it takes 20mins to train. Differently, gsgen takes 2h to train 15k steps. So I'm wondering what causes the speed difference? For example, what's the percentage of time in gsgen is used for rendering, for optimization, etc.

mdarhdarz commented 10 months ago

I think dreamgaussian focus on image-to-3d with zero123-sds and its sellpoint is just fast (quality of text-to-3d in their paper is so low...), while gsgen focus on text-to-3d with both stable diffusion and 3d point cloud diffusion. Stable diffusion is slower than zero123 due to larger resolution (SDXL is much much slower.) And one of sellpoint of gsgen is that 3d diffusion sds mitigates janus problem, but need more time. Another difference is that the second stage of dreamgaussian uses an alternative for sds and maybe this is faster to converge. For GaussianDreamer, maybe it use a more complex initialization and then fewer sds steps are required. (1200 steps need 25 min)

KelestZ commented 10 months ago

I think dreamgaussian focus on image-to-3d with zero123-sds and its sellpoint is just fast (quality of text-to-3d in their paper is so low...), while gsgen focus on text-to-3d with both stable diffusion and 3d point cloud diffusion. Stable diffusion is slower than zero123 due to larger resolution (SDXL is much much slower.) And one of sellpoint of gsgen is that 3d diffusion sds mitigates janus problem, but need more time. Another difference is that the second stage of dreamgaussian uses an alternative for sds and maybe this is faster to converge. For GaussianDreamer, maybe it use a more complex initialization and then fewer sds steps are required. (1200 steps need 25 min)

Thanks for the discussions with me. Probably I'd like to ask: (1) shape-e initialization is also incorporated in gsgen. Is it a better initialization now? (2) I am wondering how much is the front-to-back gradient calculation slower than the original implementation that is in the back-to-front order.

mdarhdarz commented 10 months ago

I think dreamgaussian focus on image-to-3d with zero123-sds and its sellpoint is just fast (quality of text-to-3d in their paper is so low...), while gsgen focus on text-to-3d with both stable diffusion and 3d point cloud diffusion. Stable diffusion is slower than zero123 due to larger resolution (SDXL is much much slower.) And one of sellpoint of gsgen is that 3d diffusion sds mitigates janus problem, but need more time. Another difference is that the second stage of dreamgaussian uses an alternative for sds and maybe this is faster to converge. For GaussianDreamer, maybe it use a more complex initialization and then fewer sds steps are required. (1200 steps need 25 min)

Thanks for the discussions with me. Probably I'd like to ask: (1) shape-e initialization is also incorporated in gsgen. Is it a better initialization now? (2) I am wondering how much is the front-to-back gradient calculation slower than the original implementation that is in the back-to-front order.

First, I'm sorry to gaussiandreamer since I do not understand their advantage. It is natural to use shap-e initialization. Previous works like 3dfuse and point-to-3d also use this. For the second question, if you are comparing sds loss in generation with l2 loss in reconstruction, sds is slower due to additional unet forward pass and additonal vae encoder in LDM. If you are comparing sds loss with l2 loss w.r.t noise and predicted noise, sds is faster due to no gradient flow in unet. In dreamgaussian and gsgen, they both use sds loss but stable diffusion is bigger than zero123 and gsgen use an another diffusion model.

KelestZ commented 10 months ago

I think dreamgaussian focus on image-to-3d with zero123-sds and its sellpoint is just fast (quality of text-to-3d in their paper is so low...), while gsgen focus on text-to-3d with both stable diffusion and 3d point cloud diffusion. Stable diffusion is slower than zero123 due to larger resolution (SDXL is much much slower.) And one of sellpoint of gsgen is that 3d diffusion sds mitigates janus problem, but need more time. Another difference is that the second stage of dreamgaussian uses an alternative for sds and maybe this is faster to converge. For GaussianDreamer, maybe it use a more complex initialization and then fewer sds steps are required. (1200 steps need 25 min)

Thanks for the discussions with me. Probably I'd like to ask: (1) shape-e initialization is also incorporated in gsgen. Is it a better initialization now? (2) I am wondering how much is the front-to-back gradient calculation slower than the original implementation that is in the back-to-front order.

First, I'm sorry to gaussiandreamer since I do not understand their advantage. It is natural to use shap-e initialization. Previous works like 3dfuse and point-to-3d also use this. For the second question, if you are comparing sds loss in generation with l2 loss in reconstruction, sds is slower due to additional unet forward pass and additonal vae encoder in LDM. If you are comparing sds loss with l2 loss w.r.t noise and predicted noise, sds is faster due to no gradient flow in unet. In dreamgaussian and gsgen, they both use sds loss but stable diffusion is bigger than zero123 and gsgen use an another diffusion model.

For the 2nd question, I'm asking the CUDA implementation of rasterization. GS:

image

versus GSGEN:

image
mdarhdarz commented 10 months ago

I think dreamgaussian focus on image-to-3d with zero123-sds and its sellpoint is just fast (quality of text-to-3d in their paper is so low...), while gsgen focus on text-to-3d with both stable diffusion and 3d point cloud diffusion. Stable diffusion is slower than zero123 due to larger resolution (SDXL is much much slower.) And one of sellpoint of gsgen is that 3d diffusion sds mitigates janus problem, but need more time. Another difference is that the second stage of dreamgaussian uses an alternative for sds and maybe this is faster to converge. For GaussianDreamer, maybe it use a more complex initialization and then fewer sds steps are required. (1200 steps need 25 min)

Thanks for the discussions with me. Probably I'd like to ask: (1) shape-e initialization is also incorporated in gsgen. Is it a better initialization now? (2) I am wondering how much is the front-to-back gradient calculation slower than the original implementation that is in the back-to-front order.

First, I'm sorry to gaussiandreamer since I do not understand their advantage. It is natural to use shap-e initialization. Previous works like 3dfuse and point-to-3d also use this. For the second question, if you are comparing sds loss in generation with l2 loss in reconstruction, sds is slower due to additional unet forward pass and additonal vae encoder in LDM. If you are comparing sds loss with l2 loss w.r.t noise and predicted noise, sds is faster due to no gradient flow in unet. In dreamgaussian and gsgen, they both use sds loss but stable diffusion is bigger than zero123 and gsgen use an another diffusion model.

For the 2nd question, I'm asking the CUDA implementation of rasterization. GS: image versus GSGEN: image

Oh, I know little about that.

heheyas commented 10 months ago

Hi KelestZ,

Sorry for the late response; I've been occupied with other tasks in the past few weeks. Your concern about the speed of our Gaussian splatting implementation is valid. In fact, our implementation is slightly slower than the official one. However, this isn't the primary reason for the longer training time. The most significant difference lies in the batch size; our approach defaults to using a batch size of 8. In most cases, a batch size greater than 2 works well because we haven't seen any speed advantage in our method from the beginning. Therefore, we haven't emphasized this aspect much. I apologize for any inconvenience. For the renderer part, I have noticed some important differences in CUDA implementation and will address these findings in the near future.

Best regards, heheyas