Closed Len-Li closed 9 months ago
I agree with you. In the original SVD code, the encoder_hidden_states
are the CLIP image encoder derived image embeddings. I think replacing them with text embeddings doesn't seem like a strange thing to do?
In short, I believe it's feasible to freeze the weights of the unet and text encoder and simply insert an additional MLP layer to learn.
Thanks! I will have a try.
Hi,
Thanks for open-sourcing this great project.
I am curious about how to implement a text2video version of SVD. Given an input image and a prompt, how to generate a video? Can I simply replace the
encoder_hidden_states
with the text embedding to finetune SVD?Thanks!