Conditioning on image + text embedding

ChintanTrivedi commented 2 years ago

Looking for pointers to get started on modifying the conditioning code below to include conditioning on an image along with text.

videos = torch.randn(2, 3, 5, 32, 32) # video (batch, channels, frames, height, width)
text = torch.randn(2, 64)             # assume output of BERT-large has dimension of 64
loss = diffusion(videos, cond = text)

So far I am trying to condition on CLIP embeddings

videos = torch.randn(2, 3, 5, 32, 32) # video (batch, channels, frames, height, width)
image_emb = torch.randn(2, 512) # image (batch, CLIP ViT32 latent representation)
text_emb = torch.randn(2, 64) # assume output of BERT-large has dimension of 64

cond_emb = torch.cat((image_emb, text_emb),dim=1) # combining both image and text inputs to the video diffusion condition

loss = diffusion(videos, cond = cond_emb)

However, is there a better way to condition on images in the pixel space rather than latent representations? This might also help to use this in an autoregressive manner for last frame of the diffusion sample as input condition for the next sample.

PS: Thanks Phil for the quick implementation of an interesting paper that doesnt have the official code out yet!

zkx06111 commented 2 years ago

I think you can try concatenating the image directly to the video frames in the channel dim. That was what SR3 (a paper using image diffusion for image super-resolution) did.

ChintanTrivedi commented 2 years ago

Thanks @zkx06111, I checked it out, and that makes a lot of sense. Shouldn't it be along the frames dim instead of channel since this is video conditioned on image, not image conditioned on image?

If Noise is (32,3,10,128,128) and image condition is (32,3,128,128), then the concatenated input would be (32,3,11,128,128) where image is added to the front of the first frame in noise.

oxjohanndiep commented 2 years ago

@ChintanTrivedi Did you had success with that?

chpk commented 1 year ago

How do you condition (image/gif + text) on a custom input, the model should be loaded from already saved milestones/checkpoints in "./results/" folder.

Thank you.

lucidrains / video-diffusion-pytorch

Conditioning on image + text embedding #7