sihyun-yu / PVDM

Official PyTorch implementation of Video Probabilistic Diffusion Models in Projected Latent Space (CVPR 2023).
https://sihyun.me/PVDM
MIT License
287 stars 15 forks source link

Some questions about changing this work to a text-to-video generation work #24

Closed xiefan233 closed 11 months ago

xiefan233 commented 1 year ago

Sorry to bother you, my recent project is text-to-video generation, and I am currently making some modifications based on your open source code.

Upon reviewing your code, I noticed that you randomly sample 32 frames from the video and divide them into two sections. The first 16 frames are used as conditions to generate the last 16 frames. I would like to ask, have you ever experimented with text-to-video before? How was the code modified?

I've modified some of your code so far, but it doesn't work after training. No matter what text is input, the generated video is almost the same, and the loss does not converge during training, and it is difficult to decrease.

Modified details: After the text is encoded with BERT, it is added to UNet through the cross-attention method, and the calculation of loss is changed from the original '(loss, t), loss_dict = criterion(z.float(), c.float())' to '(loss, t), loss_dict = criterion(z.float(), encoded_texts.float())'.

I'm a beginner, any answer from you will help me a lot, thanks!

sihyun-yu commented 11 months ago

Hi, I haven't tried it and it should be an interesting direction to explore! Hope to hear any positive direction regarding text-to-video problem :)