Open Josh00-Lu opened 2 years ago
If I am not mistaken, the two versions you mentioned only differ by the output of the $Affine$ layer. The first version only multiplies. The second version also adds. In fact, we have tried both. They performed similarly. We didn't see the need for the "addition" term, $c$, hence we dropped it.
If I am not mistaken, the two versions you mentioned only differ by the output of the Affine layer. The first version only multiplies. The second version also adds. In fact, we have tried both. They performed similarly. We didn't see the need for the "addition" term, c, hence we dropped it.
Thanks a lot!But since you used a Encoder to get $z{sem}$ , is $z{sem}$ have relation with the size of the input image?(e.g. 512*512 or 256*256 and the size of z_{sem} is different). If so, what operation did you take to fix the size of affine(z) in order to get a fixed-size affine scale?
Besides, I wonder if just use a affine to embed the latent feature, is it possible to reconstruct the original image?
And the question: is the semantic encoder fixed during the training stage of conditional diffusion model (not the latent DDIM)? Can I use a pre-trained encoder (e.g. VGG) and fix the encoder during training time?
Thanks a lot for your reply!
There are 3 questions:
Does the size of $z\text{sem}$ has anything to do with the shape of the image? Answer: No, they are independent. $z\text{sem}$ is always 512 in the paper. You can change this, independent of the size of the image. You may wonder how a fixed-sized $z_\text{sem}$ is applicable to various channels of hidden layers. There is a unique affine layer, each with a different output size, for each different hidden layer.
Is latent feature, $z\text{sem}$, enough to reconstruct the original image? Answer: No, the limited size of $z\text{sem}$ is not enough to faithfully reconstruct the original image. You also need the high bandwidth of noise maps to get the last mile of faithfulness.
Can you use a pretrained and fixed semantic encoder? Answer: You definitely can! Be noted that DiffAE in the paper is trained end-to-end, both the encoder and the decoder are trained at the same time, and nothing is frozen during training.
What would you get with a pretrained encoder? Answer: I can only offer guesses here. I would presume the results depend on what is the implicit bias of the pretrained encoder itself. One way to imagine is what is considered as "close" in the encoder of choice. For example, face identity encoders may pay close attention to the smallest details that could identify a person. Those details would be encoded in the latent code. The rendered images based on this latent code would pay close attention to the face identities with varying backgrounds, pose, and composition in the image, which are regarded as not important.
There are 3 questions:
- Does the size of zsem has anything to do with the shape of the image? Answer: No, they are independent. zsem is always 512 in the paper. You can change this, independent of the size of the image. You may wonder how a fixed-sized zsem is applicable to various channels of hidden layers. There is a unique affine layer, each with a different output size, for each different hidden layer.
- Is latent feature, zsem, enough to reconstruct the original image? Answer: No, the limited size of zsem is not enough to faithfully reconstruct the original image. You also need the high bandwidth of noise maps to get the last mile of faithfulness.
- Can you use a pretrained and fixed semantic encoder? Answer: You definitely can! Be noted that DiffAE in the paper is trained end-to-end, both the encoder and the decoder are trained at the same time, and nothing is frozen during training.
What would you get with a pretrained encoder? Answer: I can only offer guesses here. I would presume the results depend on what is the implicit bias of the pretrained encoder itself. One way to imagine is what is considered as "close" in the encoder of choice. For example, face identity encoders may pay close attention to the smallest details that could identify a person. Those details would be encoded in the latent code. The rendered images based on this latent code would pay close attention to the face identities with varying backgrounds, pose, and composition in the image, which are regarded as not important.
Wow! It's so happy to get quite a detailed explanation & reply! Awesome work, awesome author! Thanks a lot!
So the $z{sem}$ is fixed to 1*1*512, a vector? I used to think that since you use a CNN encoder, the $z{sem}$ should be dependent on the H, W of input image. e.g. a*b*512 feature. So the former (1*1*512) is what the paper used for $z_{sem}$, no matter what the H & W is of the input image?
There are 3 questions:
- Does the size of zsem has anything to do with the shape of the image? Answer: No, they are independent. zsem is always 512 in the paper. You can change this, independent of the size of the image. You may wonder how a fixed-sized zsem is applicable to various channels of hidden layers. There is a unique affine layer, each with a different output size, for each different hidden layer.
- Is latent feature, zsem, enough to reconstruct the original image? Answer: No, the limited size of zsem is not enough to faithfully reconstruct the original image. You also need the high bandwidth of noise maps to get the last mile of faithfulness.
- Can you use a pretrained and fixed semantic encoder? Answer: You definitely can! Be noted that DiffAE in the paper is trained end-to-end, both the encoder and the decoder are trained at the same time, and nothing is frozen during training.
What would you get with a pretrained encoder? Answer: I can only offer guesses here. I would presume the results depend on what is the implicit bias of the pretrained encoder itself. One way to imagine is what is considered as "close" in the encoder of choice. For example, face identity encoders may pay close attention to the smallest details that could identify a person. Those details would be encoded in the latent code. The rendered images based on this latent code would pay close attention to the face identities with varying backgrounds, pose, and composition in the image, which are regarded as not important.
I still wonder that since there's not any restrictions (or losses) to force $z{sem}$ to be the semantic information how the network learns it? A bad situation may be that the semantic encoder always output all ones 512-dim vectors (or rubbish values), the stochastic DDIM minimize the DDIM loss by just ignore the $z{sem}$ passed from semantic encoder.
So the zsem is fixed to 11512, a vector? I used to think that since you use a CNN encoder, the zsem should be dependent on the H, W of input image. e.g. ab512 feature. So the former (11512) is what the paper used for zsem, no matter what the H & W is of the input image?
Indeed, zsem is a 1x1x512 vector regardless of the image size. To keep zsem this size, you need to "scale" the encoder respectively to the image size, i.e. adding another block to downscale in the encoder to keep the zsem size intact.
It's not an iron-clad rule to have zsem of exactly this size though. However, there are some trade-offs. Image zsem with shape 4x4x512. zsem is now spatial. What's the problem of a spatial latent code? It no longer captures global semantics.
I still wonder that since there's not any restrictions (or losses) to force zsem to be the semantic information how the network learns it? A bad situation may be that the semantic encoder always output all ones 512-dim vectors (or rubbish values), the stochastic DDIM minimize the DDIM loss by just ignore the zsem passed from semantic encoder.
You hit the home run here! The best intuition I can offer is your pathological scenario is NOT likely to happen spontaneously. Why? Because the DDIM job is to denoise which would have been exponentially easier if the target image is known. A likely result is that the encoder is squeezed to capture something useful for denoising, hence, the results we have seen.
This is not proof though. We cannot rely on this intuition to guarantee that what works on face datasets would generalize to other datasets of different kinds.
how the autoencoder output vector aligns with semantic.
For this, you need to annotate your dataset. The idea is simple. We hypothesize that the learned "semantic" space is linear. Semantic is therefore a direction in that space (regardless of where you are in that space due to the linear assumption). You can derive these semantic directions in many ways. One of which is using a linear classifier trained on an attributed-annotated dataset. We trained linear classifiers on the CelebA-HQ dataset, in our case. You don't need to use the same dataset as the one used to trained the generative model.
Training DiffAE on a different dataset on a potentially different domain is easy. The hard part is to anticipate what you'll get. It's reasonable to assume that you'll get a good autoencoder with nice image compression capability. However, it's not a given that it will discover the same set of semantic as we humans see. Also, finding a semantic direction always requires labels on that dataset.
Hello! Nice work! May I ask a relatively stupid question about HOW $z{sem}$ is add to the UNet ? Let say h is the previous layer output. In your paper, such $z{sem}$ adding is like:
$out = Affine(z_{sem}) (h MLP_1(\phi _1 (t) ) + MLP_2 (\phi _2 (t)) )$
That is quite wired! Why choose a times operation here?
I found a different understanding of $z_{sem}$ adding:
$temp = (h * MLP_1(\phi _1 (t) ) + MLP_2 (\phi _2 (t)) )$
$Affine(z_{sem})=s, c$
$out = s * temp + c$
Is this understanding right?
I don't know the blue "times" in appendix figure 7 (a) mean. I suspect that it is not a "times" operation?