Open mu-cai opened 4 years ago
Yes, in mumy cases the effect of the texture code in quite restrictive. I suspect that maybe the differences of training scheme was crucial, but I don't know what could be. (Maybe it is related to image/patch resolutions.)
@rosinality Thanks for your information! During the past few days, I have trained your model under CeleAHQ dataset, and the result is quite bad. each row means original A, original B, reconstructed A and structure A + structure B. You can see that the resulting image even can't keep the pose of A.
Also, I trained it on the LSUN church dataset, the result is also not good.
You can see that the reconstruction quality is not very good, never to say the swapping results.
I think there may be several possible issues:
(1) padding, someone has already pointed that out. (2) The crop. As for the church dataset, they don't do the resize first, instead, they do the crop.
I also wonder how can they keep the original image size when keeping the ratio and the short side(256) unchanged.
(3) The co-occur unit. You can see that in paper they stated that
And for each prediction, they did the following operation:
So this operation should be down for 8 times.
Thanks again for your nice work!
Best, Mu
Thank you for your testing & check!
@rosinality
Thanks for your reply!
You have already fixed this problem yesterday(already committed).
Yes, the shorter side is 256, however, the size of the longer side is not fixed. However, during training, you resized the image into a square in your code, making the ratio of two sides changed, which mismatched with the paper.
Yes! Your single operation should be done for 8 times. Because when you sample a patch from one real image and 8 patches from the fake image, you will get just one prediction. You need 8$N$ predictions, no $N$.
Thanks again for your answer!
Best, Mu
@rosinality
Thanks for your quick programming! I have run your code just now, and one more question: In your code, for each structure/texture pair, you have 8 crops for the real/fake images, but only 4 crops for the ref image. However, I think that for each crop of the real/fake image, we need 4 patches. That is to say, we need 4*8=32 patches in total.
This is my understanding, however, the author didn't state this in his paper... what is your opinion?
Mu
Hmm maybe you are right. But as model uses mean of ref image vectors, maybe it is not very different from using distinct ref patches for each samples. (Hopefully.)
I have changed to use distinct reference samples for each samples. It is less resource consuming than I thought, and I suspect that it will be more robust way to do the training.
@rosinality
Thanks for your working! Yes, in my opinion, if the training iterations are large enough, then the fixed reference samples would produce the same result as the distinct reference samples. Yes, I also think that the model would be more robust if adopting the distinct reference patches. The GPU memory won't increase too much if doing so... also superised.
Mu
My tf implementation: https://github.com/zhangqianhui/Swapping-Autoencoder-tf. Hope to help you.
Hi @mu-cai,
Did the above corrections lead to better structure/style swapping results on your side ?
Hi Rosinality,
Thanks for your excellent code! It is quite excellent! However, I found that the result of human images is not good. For example, your results are:
(It is obvious that the eyeglasses for the generated images are different from the paper's claim.)
Maybe the problem is the training scheme? Or the cropping method?