Why not text to 4D? - Githubissues

jiawei-ren / dreamgaussian4d

[arXiv 2023] DreamGaussian4D: Generative 4D Gaussian Splatting

https://jiawei-ren.github.io/projects/dreamgaussian4d/

MIT License

508 stars 31 forks source link

Why not text to 4D? #10

Open umarkhalidAI opened 8 months ago

umarkhalidAI commented 8 months ago

Have you experimented with text to 4D since original DreamGaussian has both image and text based 3D generation. Can you discuss your results with text based 4D generation?

jiawei-ren commented 8 months ago

Text-to-4D is surely important to explore, but for now, it is a bit tricky to control the motion with text since SVD does not support text conditions. We are trying other video diffusion models and plan to update text-to-4D results in a future revision of the technical report.

umarkhalidAI commented 8 months ago

My understanding is that SVD can take text input. But, AYG pipeline is similar where they first generate 3D object based on text and then employ video diffusion model. Did you try with their approach (they are just generating an object based on the prompt)?

jiawei-ren commented 8 months ago

My understanding is that SVD can take text input.

Could you direct me to where the text-conditioned model can be accessed? It seems that SVD on Hugging Face does not support text conditions (as stated in their limitation section).

where they first generate 3D object based on text and then employ video diffusion model.

A text-to-image-to-4D pipeline is of course doable and worth-trying. We don't have the results now but may update them soon. However, AYG is more capable than this pipeline since their 3D motion can also be controlled by text.

umarkhalidAI commented 8 months ago

Well, I haven't fully explored their model. But I remember when I read the paper. They do mention text to video results. Figure 1 row 1. Also, abstracts indicates it. https://arxiv.org/pdf/2311.15127.pdf