Some questions about changing this work to a text-to-video generation work

Sorry to bother you, my recent project is text-to-video generation, and I am currently making some modifications based on your open source code.

Upon reviewing your code, I noticed that you randomly sample 32 frames from the video and divide them into two sections. The first 16 frames are used as conditions to generate the last 16 frames. I would like to ask, have you ever experimented with text-to-video before? How was the code modified?

I've modified some of your code so far, but it doesn't work after training. No matter what text is input, the generated video is almost the same, and the loss does not converge during training, and it is difficult to decrease.

Modified details: After the text is encoded with BERT, it is added to UNet through the cross-attention method, and the calculation of loss is changed from the original '(loss, t), loss_dict = criterion(z.float(), c.float())' to '(loss, t), loss_dict = criterion(z.float(), encoded_texts.float())'.

I'm a beginner, any answer from you will help me a lot, thanks!

sihyun-yu / PVDM

Some questions about changing this work to a text-to-video generation work #24