Result Mismatch with the original results in paper in human faces

rosinality / swapping-autoencoder-pytorch

Unofficial implementation of Swapping Autoencoder for Deep Image Manipulation (https://arxiv.org/abs/2007.00653) in PyTorch

Other

256 stars 47 forks source link

Result Mismatch with the original results in paper in human faces #6

Open mu-cai opened 4 years ago

mu-cai commented 4 years ago

Hi Rosinality,

Thanks for your excellent code! It is quite excellent! However, I found that the result of human images is not good. For example, your results are:

(It is obvious that the eyeglasses for the generated images are different from the paper's claim.)

Maybe the problem is the training scheme? Or the cropping method?

rosinality commented 4 years ago

Yes, in mumy cases the effect of the texture code in quite restrictive. I suspect that maybe the differences of training scheme was crucial, but I don't know what could be. (Maybe it is related to image/patch resolutions.)

mu-cai commented 4 years ago

@rosinality Thanks for your information! During the past few days, I have trained your model under CeleAHQ dataset, and the result is quite bad. each row means original A, original B, reconstructed A and structure A + structure B. You can see that the resulting image even can't keep the pose of A.

Also, I trained it on the LSUN church dataset, the result is also not good.

You can see that the reconstruction quality is not very good, never to say the swapping results.

I think there may be several possible issues:

(1) padding, someone has already pointed that out. (2) The crop. As for the church dataset, they don't do the resize first, instead, they do the crop.

I also wonder how can they keep the original image size when keeping the ratio and the short side(256) unchanged.

(3) The co-occur unit. You can see that in paper they stated that

And for each prediction, they did the following operation:

So this operation should be down for 8 times.

Thanks again for your nice work!

Best, Mu

rosinality commented 4 years ago

Could you let me know which part of the padding is incorrect?
LSUN church dataset is already resized to be shorter side size is 256, so resizing will not affect the result.
Do you mean that cooccur discriminator should be done on 8 patches? Hmm, it seems like that this is different from the paper. I will try to fix this.

Thank you for your testing & check!

mu-cai commented 4 years ago

@rosinality

Thanks for your reply!

You have already fixed this problem yesterday(already committed).
Yes, the shorter side is 256, however, the size of the longer side is not fixed. However, during training, you resized the image into a square in your code, making the ratio of two sides changed, which mismatched with the paper.
Yes! Your single operation should be done for 8 times. Because when you sample a patch from one real image and 8 patches from the fake image, you will get just one prediction. You need 8$N$ predictions, no $N$.

Thanks again for your answer!

Best, Mu

rosinality commented 4 years ago

Actually padding will not affect results, as that bug will only affect 1x1 convs in current implementation
As prepare_data.py will do resizing with torchvision, it will respect aspect ratios by default.
Seems like that it is important issue. Fixed it at 38cb3aeecbc232a1c1a405c45f1026bd5db36061.

mu-cai commented 4 years ago

@rosinality

Thanks for your quick programming! I have run your code just now, and one more question: In your code, for each structure/texture pair, you have 8 crops for the real/fake images, but only 4 crops for the ref image. However, I think that for each crop of the real/fake image, we need 4 patches. That is to say, we need 4*8=32 patches in total.

This is my understanding, however, the author didn't state this in his paper... what is your opinion?

rosinality commented 4 years ago

Hmm maybe you are right. But as model uses mean of ref image vectors, maybe it is not very different from using distinct ref patches for each samples. (Hopefully.)

rosinality commented 4 years ago

I have changed to use distinct reference samples for each samples. It is less resource consuming than I thought, and I suspect that it will be more robust way to do the training.

mu-cai commented 4 years ago

@rosinality

Thanks for your working! Yes, in my opinion, if the training iterations are large enough, then the fixed reference samples would produce the same result as the distinct reference samples. Yes, I also think that the model would be more robust if adopting the distinct reference patches. The GPU memory won't increase too much if doing so... also superised.

zhangqianhui commented 3 years ago

My tf implementation: https://github.com/zhangqianhui/Swapping-Autoencoder-tf. Hope to help you.

virgile-blg commented 3 years ago

Hi @mu-cai,

Did the above corrections lead to better structure/style swapping results on your side ?