pixeli99 / SVD_Xtend

Stable Video Diffusion Training Code and Extensions.
481 stars 45 forks source link

Questions on text2video? #25

Open hitsz-zuoqi opened 6 months ago

hitsz-zuoqi commented 6 months ago

when I try to figure out how to adapt the framework for text2video synthesis, I found that the SpatialTemporalUNet has a input channel 8 which is depicted in this line:


    @register_to_config
    def __init__(
        self,
        sample_size: Optional[int] = None,
        in_channels: int = 8,
        out_channels: int = 4,
        down_block_types: Tuple[str] = (

Then I check the pipeline inference and I found the denoising input is actually a concatenation of a noise and latent input:


# Concatenate image_latents over channels dimention
latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)

my question is that how to obtain the image_latents if we only use text as a input when training a text2video model? Do you recently have any progress on text2video?

pixeli99 commented 6 months ago

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

hitsz-zuoqi commented 6 months ago

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk" step_23500_val_img_0_a-desk

Prompt: "a sofa" step_23500_val_img_0_a-sofa

When the training beginning, the sampling results are: Prompt: "a desk" step_1_val_img_0_a-desk Prompt: "a sofa" step_1_val_img_0_a-sofa

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

liiiiiiiiil commented 6 months ago

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk" step_23500_val_img_0_a-desk step_23500_val_img_0_a-desk

Prompt: "a sofa" step_23500_val_img_0_a-sofa step_23500_val_img_0_a-sofa

When the training beginning, the sampling results are: Prompt: "a desk" step_1_val_img_0_a-desk step_1_val_img_0_a-desk Prompt: "a sofa" step_1_val_img_0_a-sofa step_1_val_img_0_a-sofa

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

The first two videos look very good, how did u do that?

pixeli99 commented 6 months ago

It looks like it's working well, may I ask how many steps this was trained for?

CallMeFrozenBanana commented 4 months ago

This is precisely the problem I am facing at the moment. If we want to do text2video, the existence of image_latents is quite peculiar. I've tried changing the conv in of unet to 4 channels, but doing so, at least so far, I haven't succeeded in training, and the model can't generate normal videos(everything is a hazy expanse...). If anyone has any suggestions, feel free to share them here, and I will give them a try.

Yes, same phenomenon observed by my modification, some results of my fintuning on Objaverse looks like: Prompt: "a desk" step_23500_val_img_0_a-desk step_23500_val_img_0_a-desk

Prompt: "a sofa" step_23500_val_img_0_a-sofa step_23500_val_img_0_a-sofa

When the training beginning, the sampling results are: Prompt: "a desk" step_1_val_img_0_a-desk step_1_val_img_0_a-desk Prompt: "a sofa" step_1_val_img_0_a-sofa step_1_val_img_0_a-sofa

From the training performance, I think it is nearly equal to train from scratch for my task if changing the conv in of unet to 4 channels.

It seems there is a different latent space between the text2video and img2video models. By the way, what model are you finetuning on the Objaverse dataset? It looks work...?