Hi thanks for the great work.
While this is not explored in paper, but since the conditions are aligned, in principle it should be possible to perform img2img task to encode an image and decode the latent to a similar image (at least in some aspect)?
However simple experiments show this fails
I was expecting it to behave like DALLE-2's UNCLIP decoder. Any idea why this does not work? Are you applying some special attention map so that one modality do not attend to inputs of same modality?
Hi thanks for the great work. While this is not explored in paper, but since the conditions are aligned, in principle it should be possible to perform img2img task to encode an image and decode the latent to a similar image (at least in some aspect)? However simple experiments show this fails
I was expecting it to behave like DALLE-2's UNCLIP decoder. Any idea why this does not work? Are you applying some special attention map so that one modality do not attend to inputs of same modality?