Open hitsz-zuoqi opened 6 months ago
This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in
of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...).
If anyone has any suggestions, feel free to share them here, and I will give them a try.
This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the
conv in
of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.
Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like:
Prompt: "a desk"
Prompt: "a sofa"
When the training beginning, the sampling results are:
Prompt: "a desk"
Prompt: "a sofa"
From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in
of unet to 4 channels.
This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the
conv in
of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk"
![]()
![]()
When the training beginning, the sampling results are: Prompt: "a desk"
![]()
Prompt: "a sofa"
![]()
![]()
From the training performance, I think it is nearly equal to train from scratch for my task if changing the
conv in
of unet to 4 channels.
The first two videos look very good, how did u do that?
It looks like it's working well, may I ask how many steps this was trained for?
This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the
conv in
of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk"
![]()
![]()
When the training beginning, the sampling results are: Prompt: "a desk"
![]()
Prompt: "a sofa"
![]()
![]()
From the training performance, I think it is nearly equal to train from scratch for my task if changing the
conv in
of unet to 4 channels.
It seems there is a different latent space between the text2video and img2video models. By the way, what model are you finetuning on the Objaverse dataset? It looks work...?
when I try to figure out how to adapt the framework for text2video synthesis, I found that the SpatialTemporalUNet has a input channel 8 which is depicted in this line:
Then I check the pipeline inference and I found the denoising input is actually a concatenation of a noise and latent input:
my question is that how to obtain the image_latents if we only use text as a input when training a text2video model? Do you recently have any progress on text2video?