Open umarkhalidAI opened 8 months ago
Text-to-4D is surely important to explore, but for now, it is a bit tricky to control the motion with text since SVD does not support text conditions. We are trying other video diffusion models and plan to update text-to-4D results in a future revision of the technical report.
My understanding is that SVD can take text input. But, AYG pipeline is similar where they first generate 3D object based on text and then employ video diffusion model. Did you try with their approach (they are just generating an object based on the prompt)?
My understanding is that SVD can take text input.
Could you direct me to where the text-conditioned model can be accessed? It seems that SVD on Hugging Face does not support text conditions (as stated in their limitation section).
where they first generate 3D object based on text and then employ video diffusion model.
A text-to-image-to-4D pipeline is of course doable and worth-trying. We don't have the results now but may update them soon. However, AYG is more capable than this pipeline since their 3D motion can also be controlled by text.
Well, I haven't fully explored their model. But I remember when I read the paper. They do mention text to video results. Figure 1 row 1. Also, abstracts indicates it. https://arxiv.org/pdf/2311.15127.pdf
Have you experimented with text to 4D since original DreamGaussian has both image and text based 3D generation. Can you discuss your results with text based 4D generation?