pixeli99 / SVD_Xtend

Stable Video Diffusion Training Code and Extensions.
574 stars 55 forks source link

Do you have any advice on text2video? #11

Closed Len-Li closed 9 months ago

Len-Li commented 9 months ago

Hi,

Thanks for open-sourcing this great project.

I am curious about how to implement a text2video version of SVD. Given an input image and a prompt, how to generate a video? Can I simply replace the encoder_hidden_stateswith the text embedding to finetune SVD?

Thanks!

pixeli99 commented 9 months ago

I agree with you. In the original SVD code, the encoder_hidden_states are the CLIP image encoder derived image embeddings. I think replacing them with text embeddings doesn't seem like a strange thing to do? In short, I believe it's feasible to freeze the weights of the unet and text encoder and simply insert an additional MLP layer to learn.

Len-Li commented 9 months ago

Thanks! I will have a try.