hubert0527 / COCO-GAN

COCO-GAN: Generation by Parts via Conditional Coordinating (ICCV 2019 oral)
https://hubert0527.github.io/COCO-GAN/
MIT License
268 stars 29 forks source link

Coordination computation #5

Closed penguinbing closed 4 years ago

penguinbing commented 4 years ago

I'm confused about the coordination computation. Could you explain it more clearly? Thank you.

EliotChenKJ commented 4 years ago

Hi, Me either :).

To be specific, the supplementary of the paper mentioned that c'' belongs to [-1.66, 1.66] in the extrapolation experiment (lsun dataset [N4,M4,S64]), but i don't understand the reason behind it (what is the meaning of 2 / (4-1)?). Also there may be a typo in that description (the range of micro coordinate was computed twice)~you may take a look.

Hope that the author can give us an relatively simple example of the entire coordinate system :).

hubert0527 commented 4 years ago

Actually, you can define any coordinate system for the micro/macro coordinate system (for example, we also show that you can even use a cylindrical coordinate system), as long as the transformation between them is reasonable. It is actually super hard for me to clearly define what is a reasonable design for that.

To me, I feel like I just choose the most straightforward one. But to make it generic (supporting any N and M), it turns out the code looks super complex. In fact, the individual setting is quite simple (you may just print them out). Take the [N2, M2] setting as example, we only use 16 constant micro coordinates and 9 constant macro coordinates (since the full image generation is split into 16 micro patches generation. Meanwhile, each macro patch is formed by combining 2x2 micro patches, so there exists 9 combinations), like the following:

Micro coordinate system:
(-1, -1),      (-1, -0.33),      (-1, 0.33),      (-1, 1)
(-0.33, -1),   (-0.33, -0.33),   (-0.33, 0.33),   (-0.33, 1)
(0.33, -1),    (0.33, -0.33),    (0.33, 0.33),    (0.33, 1)
(1, -1),       (1, -0.33),       (1, 0.33),       (1, 1)

Macro coordinate system:
(-1, -1), (-1, 0), (-1, 1)
(0, -1),  (0, 0),  (0, 1)
(1, -1),  (1, 0),  (1, 1)

And the transformation from micro to macro is to ensure that, after multiple micro patches are stitched together, the position of the newly formed macro patch matches to an existing constant coordinate in the macro coordinate system.

e.g., After micro patches at [(-1, -1), (-1, -0.33), (-0.33, -1), (-0.33, -0.33)] are generated, and composed to a macro patch M. The macro coordinate of M must be at the most top-left corner, which is (-1, -1).

In practice, I do the sampling in the macro coordinate system here, then map the sampled macro coordinates to micro coordinates here. Such an implementation ensures that I can do any sort of sampling in the macro coordinate system as I want for the training.

hubert0527 commented 4 years ago

@EliotChenKJ In the supplementary, the setting should be (N2, M2, S64), that is a typo needs to be fixed. The logic behind the equation Y / Z = 2 / (4 - 1) is:

 - Y is the range of coordinate, the range is [-1, 1]
 - The full image resolution is 256
 - S64 
    => The micro patch resolution is 64 
    => 4x4 micro patches to form a full image
- The distance between two consecutive patches is defined by the distance between their center.
    => When there are 4 patches, the space between the left-most and the right-most patch only has (4-1) patches [Z comes from here].

the range of micro coordinate was computed twice

Yes, thanks for the reminder, the latter one should be macro coordinate.

EliotChenKJ commented 4 years ago

@hubert0527 , Hi~ thanks for your answers with patience. It's so nice of you to expain the coordinate system so detailed and specific, which i can really understand now. There are also some questions i still have, hope that you can help me get through it:

At last, thank you for presenting us such amazing work.

hubert0527 commented 4 years ago
Can I explain the COCOGAN like...

Yes and no, actually, the CNN weights (i.e., shared representations, shared features) are shared for all coordinates, the only two differences are: (a) input coordinates, and (b) the conditional batch norm parameters.

At the very first glimpse, people usually think that the generator only learns to generate different organs (for the human faces example) for different coordinates. We show that it is not the case by showing experiments on the CelebA-syn dataset, which the human faces are not aligned at all. Furthermore, the LSUN bedroom is also not aligned.

Actually, it is awkward to think the conditional distribution of each coordinate is significantly different from each other, the generator just learns the conditional distribution of the individual coordinate that whatsoever presented to it.

Alternatively, I prefer to explain the whole thing more like: the generator still learns the conventional GANs mapping (from a latent variable to an image, it is true for COCO-GAN testing, right?), but, with an additional conditional coordinate input to query which part of the image to generate and to train with.

What is the ground truth macro patches in the "beyond-boundary generation" experiment?

For those macro patches that exceed the image boundary (i.e., outside the 256x256 area), there is no ground truth. The discriminator implicitly learns the rule that, there shouldn't be any clear seems between consecutive micro patches, and the post-training enforces the generator to be aware of such a rule even outside the image boundary. And, to enforce the discriminator not to forget the rule (consecutive patches must be continuous), the weights of most of the discriminator layers are freezed during the post-training.

Thus, since there is no ground truth, the post-training cannot be continued forever. At some point, the generator will exploit the discriminator. You will have to stop the post-training at some point (requires heuristic decision).

EliotChenKJ commented 4 years ago

It's really clear and i further understand now. Sincerely thank you for answering these problems so soon and clearly :).