showlab / Show-1

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
https://showlab.github.io/Show-1/
Other
1.07k stars 63 forks source link

confused about the temporal attention #33

Open Worromots opened 6 months ago

Worromots commented 6 months ago
image

the shape of sampe is (batch, num_frames, channel, height, width), so the sample.shape[2] should be the number of channel. but here, you set num_frames=sample.shape[2], is there a problem here?

image

then, num_frames is used to reshape sampe

image

looking forward to your reply!

junhaozhang98 commented 6 months ago

Hi the sample‘s shape should be(B,C,F,H,W). The annotation is wrong.

Worromots commented 6 months ago

THX! I noticed that there are no files for calculating metrics (IS, FVD) in the code. How should I calculate these metrics? Thank you!

zhangjiewu commented 5 months ago

You might refer to following codes for the implementation of IS and FVD:

IS: https://github.com/google-research/magvit/blob/main/videogvt/train_lib/inception_score.py FVD: https://github.com/google-research/magvit/blob/main/videogvt/train_lib/frechet_distance.py