Open ChintanTrivedi opened 2 years ago
I think you can try concatenating the image directly to the video frames in the channel dim. That was what SR3 (a paper using image diffusion for image super-resolution) did.
Thanks @zkx06111, I checked it out, and that makes a lot of sense. Shouldn't it be along the frames
dim instead of channel
since this is video conditioned on image, not image conditioned on image?
If Noise
is (32,3,10,128,128)
and image
condition is (32,3,128,128)
, then the concatenated input would be (32,3,11,128,128)
where image is added to the front of the first frame in noise.
@ChintanTrivedi Did you had success with that?
How do you condition (image/gif + text) on a custom input, the model should be loaded from already saved milestones/checkpoints in "./results/" folder.
Thank you.
Looking for pointers to get started on modifying the conditioning code below to include conditioning on an image along with text.
So far I am trying to condition on CLIP embeddings
However, is there a better way to condition on images in the pixel space rather than latent representations? This might also help to use this in an autoregressive manner for last frame of the diffusion sample as input condition for the next sample.
PS: Thanks Phil for the quick implementation of an interesting paper that doesnt have the official code out yet!