Open HyZhu39 opened 3 years ago
There are some differences between the cleaned code and the original code indeed. But I do think that it would be better rather than worse. Sorry for that and I would try my best to help you to reproduce the quantitative results. I will response to you tomorrow, please wait.
There are some differences between the cleaned code and the original code indeed. But I do think that it would be better rather than worse. Sorry for that and I would try my best to help you to reproduce the quantitative results. I will response to you tomorrow, please wait.
thanks for your attention and your quick reply, I will look forward to your reply!
@HyZhu39 Hello, how you get the FID between images generated by 5 style codes and the real images? The generated images for 5 style codes should be put into 5 folders as expected and calculate the average FID between each of them and the real images. Each folder has the same number of images as the original source images. For disentanglement in our experiments, the reference-guided style codes are randomly sampled from all images with bangs.
@HyZhu39 Hello, how you get the FID between images generated by 5 style codes and the real images? The generated images for 5 style codes should be put into 5 folders as expected and calculate the average FID between each of them and the real images. Each folder has the same number of images as the original source images. For disentanglement in our experiments, the reference-guided style codes are randomly sampled from all images with bangs.
Actually, I did put them in one folder and calculated two folders' FID as the result, and for disentanglement experiments, I just selected from test images with bangs as reference images. Thanks for pointing out that, I'll have a try as you said and tell you the results. I think what you said actually the point. Thanks again.
You're welcomed. Since there are same identities in one folder, the FID (which uses the variance of the image features) would definitely become bigger.
You're welcomed. Since there are same identities in one folder, the FID (which uses the variance of the image features) would definitely become bigger.
Sorry for bothering again. I tried to put generated images with 5 different style codes separately by style code they used and tested, but it seems that the results are getting worse... that's wired, I think. I did two group experiments with the self-trained model I used in my first comment.
experiment 1:
realism:
(input images: all images with attirbute "without_bangs" of test images(first 3000 images) translated to "with_bangs";
reference images: randomly sampled 5 images with attribute "with_bangs" in all images;
calculate FID with: all images with attribute "with_bangs" of test images, and resized to 128×128)
L: R: G:
0: 26.45 26.59
1: 26.47 26.64
2: 26.44 27.04
3: 26.84 28.99
4: 25.90 26.38
average: 26.42 27.13 0.71
(randomly chosen references images:5645.jpg、6245.jpg、13652.jpg、14380.jpg、27363.jpg)
disentanglement:
(input images: all images with attirbutes "without_bangs"、" young"、“male“ of test images, translated to "with_bangs";
reference images: randomly sampled 5 images with attribute "with_bangs" in all images;
calculate FID with: all images with attributes "with_bangs"、" young"、“male“ of test images, and resized to 128×128)
L: R: G:
0: 88.79 87.49
1: 88.28 85.61
2: 87.23 92.51
3: 89.40 86.07
4: 88.30 88.11
average: 88.40 87.96 0.44
(randomly chosen references images:426.jpg、19849.jpg、22869.jpg、26513.jpg、28732.jpg)
experiment 2:
realism:same setting as experiment 1;
L: R: G:
0: 27.53 26.78
1: 32.38 26.40
2: 25.72 31.98
3: 28.18 27.48
4: 26.58 27.02
average: 28.08 27.93 0.17
(randomly chosen references images:5645.jpg、6245.jpg、13652.jpg、14380.jpg、27363.jpg)
disentanglement:same setting as experiment 1;
L: R: G:
0: 86.59 86.61
1: 89.13 90.18
2: 85.41 94.21
3: 89.02 87.94
4: 86.36 91.12
average: 87.30 90.01 2.71
(randomly chosen references images:923.jpg、1232.jpg、12886.jpg、24491.jpg、26797.jpg)
I resized and saved the images that calculated FID with as "easy_use.py" did:
transform = transforms.Compose([transforms.Resize(image_size), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
x = transform(Image.open('image_save_path here').convert('RGB')).unsqueeze(0)
vutils.save_image(((x + 1) / 2), save_path, padding=0)
by the way, I trained the model with a single GTX 1080Ti 11GB GPU for 200000 iter steps as the config file: celeba-hq.yaml.
Actually, you need to randomly sample the reference images for each source image. If you sample only one reference image to translate all the source images into 'with_bangs', the bangs in the translated folder will be the same, right? So the process should be like:
For i in range(5):
For each source image x:
randomly sample a reference image y
translate x using y as reference
calculate FID
calculate Average FID
So the problem may be that you put sample a reference image y
before the loop of source images.
Actually, you need to randomly sample the reference images for each source image. If you sample only one reference image to translate all the source images into 'with_bangs', the bangs in the translated folder will be the same, right? So the process should be like:
For i in range(5): For each source image x: randomly sample a reference image y translate x using y as reference calculate FID calculate Average FID
So the problem may be that you put
sample a reference image y
before the loop of source images.
Thank you very much for your patience and help, I will try again as soon as possible and give you feedback.
Sorry for mistakes I made and my misunderstanding of your experiment settings, I think I understand your experiment settings actually now. I randomly sample the reference images for each source image as your proposed logic. Then I redid the experiments as you said, and get relative more stable results than before like these:
realism:
L: R: G:
group 1:
0: 25.70 25.66
1: 25.60 25.53
2: 25.48 25.56
3: 25.69 25.58
4: 25.53 25.74
avg: 25.60 25.61 0.01
group 2:
0: 25.61 25.60
1: 25.53 25.61
2: 25.60 25.55
3: 25.65 25.66
4: 25.60 25.63
avg: 25.60 25.61 0.01
distanglement: L: R: G: group 1: 0: 85.71 84.91 1: 86.57 84.96 2: 86.14 85.51 3: 85.50 85.61 4: 86.56 85.51 avg: 86.10 85.30 0.80 group 2: 0: 85.89 85.87 1: 86.41 85.13 2: 86.11 85.91 3: 85.88 84.57 4: 87.12 85.80 avg: 86.28 85.46 0.82
However, the results are still much worse than the paper's, I think that might be something wrong with my training stage, I think maybe I should re-train my model and have another try. While, considering that I use exactly the same hardware-conditions and exactly the same training settings, yet the results are worse. There is also a possibility that because the training code has been changed, the previous training settings may not make the current training model fully converge. (In fact, according to the loss curve during training, the adversarial losses'(generator's and discriminator's) curves are quite unstable, yet this is also might because of the characteristics of the GAN structure itself).
In fact, I don't know much about image translation, I'm just a beginner of image translation researchers in a way, I hope you don't get bored because of my ignorance.
It's always encouraged to ask in research. Can you share the qualitative results of your self-trained checkpoint here?
Many thanks for your help. I have packed some qualitative experiment results and the images of my quantitative experiment (if you needed) in the following Baidu Yun link. Thank you for your willingness to help. https://pan.baidu.com/s/1r1deZsdbJ4RgFhTXRUKjpQ Extraction code: HISD and my checkpoint file(if needed): https://pan.baidu.com/s/1C6_Pm-gEpwGQFRDaMBDNNg Extraction code: HISD
The qualitative results seem to be promising. I calculate FID using StarGANv2's script. I check the difference between StarGANv2's and pytorch-FID and find that these is a preprocessing in the former one, which is
def get_eval_loader(root, img_size=256, batch_size=32,
imagenet_normalize=True, shuffle=True,
num_workers=4, drop_last=False):
print('Preparing DataLoader for the evaluation phase...')
if imagenet_normalize:
height, width = 299, 299
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
else:
height, width = img_size, img_size
mean = [0.5, 0.5, 0.5]
std = [0.5, 0.5, 0.5]
transform = transforms.Compose([
transforms.Resize([img_size, img_size]),
transforms.Resize([height, width]),
transforms.ToTensor(),
transforms.Normalize(mean=mean, std=std)
])
dataset = DefaultDataset(root, transform=transform)
return data.DataLoader(dataset=dataset,
batch_size=batch_size,
shuffle=shuffle,
num_workers=num_workers,
pin_memory=True,
drop_last=drop_last)
So there may be a proprecessing (a simple normalization) which you need to add in your code. Let me know the results and I think we are close to make it.
The qualitative results seem to be promising. I calculate FID using StarGANv2's script. I check the difference between StarGANv2's and pytorch-FID and find that these is a preprocessing in the former one, which is
def get_eval_loader(root, img_size=256, batch_size=32, imagenet_normalize=True, shuffle=True, num_workers=4, drop_last=False): print('Preparing DataLoader for the evaluation phase...') if imagenet_normalize: height, width = 299, 299 mean = [0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] else: height, width = img_size, img_size mean = [0.5, 0.5, 0.5] std = [0.5, 0.5, 0.5] transform = transforms.Compose([ transforms.Resize([img_size, img_size]), transforms.Resize([height, width]), transforms.ToTensor(), transforms.Normalize(mean=mean, std=std) ]) dataset = DefaultDataset(root, transform=transform) return data.DataLoader(dataset=dataset, batch_size=batch_size, shuffle=shuffle, num_workers=num_workers, pin_memory=True, drop_last=drop_last)
So there may be a proprecessing (a simple normalization) which you need to add in your code. Let me know the results and I think we are close to make it.
Sorry for delaying the reply. Actually, without this preprocessing caused the much lower FID results, with StarGANv2's script, the FID results of my latest released results improved to : group 1: L R G realism: 21.27 21.34 0.07 disentanglement: 72.55 72.51 0.04 group2: L R G realism: 21.28 21.24 0.04 disentanglement: 72.31 72.33 0.02 compared to paper's results: Realism: L:21.37 R:21.49 G:0.12 Disentanglement: L:71.85 R:71.48 G:0.37 though the results of "disentanglement"'s results are still a little worse, I am not sure about the approximate range of FID fluctuations under normal circumstances, maybe it's acceptable?
I do think this is acceptable. In the paper, we also discuss about the contradiction point between the Realism and Disentanglement (see Sec 4.3 about model without tag-irrelevant conditions). Therefore achieving better results in both Realism and Disentanglement also surprise me in the beginning. After all, the differences between the released code and the original one are:
I've change the README to clarify the corrected FID script I use in the quantitative results, thank you for your enthusiastic reproduction!
Thank you for your help again. It's your selfless help that I can successfully reproduce your experiment results. We communicate in English here for the convenience of other people’s references. Here I would like to thank you again privately: 感谢一直以来的耐心帮助,诚心祝愿后续科研工作顺利~
你也是~
Many thanks for your help. I have packed some qualitative experiment results and the images of my quantitative experiment (if you needed) in the following Baidu Yun link. Thank you for your willingness to help. https://pan.baidu.com/s/1r1deZsdbJ4RgFhTXRUKjpQ Extraction code: HISD and my checkpoint file(if needed): https://pan.baidu.com/s/1C6_Pm-gEpwGQFRDaMBDNNg Extraction code: HISD
Could you share the images of your quantitative experiment again because the Baidu Yun link is invalid? I am also reproducing the quantitative experiment results in the paper, following your issue but I can not get the result close to the paper.
@oldrive What‘s your detailed setting for your reproduction.
@oldrive What‘s your detailed setting for your reproduction.
config: celeba-hq.yaml checkpoint: checkpoint_128_celeba-hq.pt compute_fid_script: use fid.py in stargan2 to compute fid between fake_images and real_images realism fid of L: fake_images = [latent_images_0, latent_images_1, latent_images_2, latent_images_3, latent_images_4] latent_images_i is generated from test_bangs_without accroding to Test_Bangs_without.txt use random latent as guide. real_images = [test_bangs_with images accrodding to Test_Bangs_with.txt] realism_latent_fid_average = ( fid(fake_images[0], real_images) + ... + fid(fake_images[4], real_images) ) / 5
realism fid of G: fake_images = [reference_images_0, reference_images_1, reference_images_2, reference_images_3, reference_images_4] reference_images_i is generated from all_bangs_with according to Bangs_with.txt and Test_Bangs_with.txt use random reference guide. real_images = [test_bangs_with images accrodding to Test_Bangs_with.txt] realism_reference_fid_average = ( fid(fake_images[0], real_images) + ... + fid(fake_images[4], real_images) ) / 5
The result of the realism_fid as follows: Group 1: realism_fid_latent_0: 31.692982996524883 realism_fid_latent_1: 31.671476972145367 realism_fid_latent_2: 31.620433186098698 realism_fid_latent_3: 31.629911284997206 realism_fid_latent_4: 31.73387679777522 realism_fid_reference_0: 32.591734278849 realism_fid_reference_1: 32.215290934387426 realism_fid_reference_2: 32.18949934088806 realism_fid_reference_3: 32.287988762946526 realism_fid_reference_4: 32.304219580808336 realism_fid_latent_average: 31.669736247508276 realism_fid_reference_average: 32.31774657957587
Group 2: realism_fid_latent_0: 31.642293517652654 realism_fid_latent_1: 31.623934807071 realism_fid_latent_2: 31.68461378392377 realism_fid_latent_3: 31.631847657251797 realism_fid_latent_4: 31.67548435280436 realism_fid_reference_0: 32.29246639585722 realism_fid_reference_1: 32.288538090496914 realism_fid_reference_2: 32.11632434611198 realism_fid_reference_3: 32.15312062309697 realism_fid_reference_4: 32.23484964483734 realism_fid_latent_average: 31.651634823740714 realism_fid_reference_average: 32.21705982008008
What's the command you used when you calculate the FID?
What's the command you used when you calculate the FID?
Just like this: latent_fid_value = calculate_fid_given_paths([real_path, fake_latent_path[i]], args.img_size, args.batch_size)
The "args.img_size" is set to be 128, right?
The "args.img_size" is set to be 128, right?
right. parser.add_argument('--img_size', type=int, default=128, help='image resolution')
What about the qualitative results.
What about the qualitative results.
the results have replied in above mention.
I mean the visual results.
I mean the visual results.
Oh, I misunderstand your means.
some results of realism_latent_0 are here some results of realism_reference_0 are here
I mean the visual results.
Every image in a fold has a different style of bangs.
The visual results seems normal. Please change the image size used in FID to 256 or 224. I don't quite remember the setting here, since that the inception network is trained at a specific resolution.
The visual results seems normal. Please change the image size used in FID to 256 or 224. I don't quite remember the setting here, since that the inception network is trained at a specific resolution.
I'll have a try as you said and tell you the results. Thanks for your reply!
The visual results seems normal. Please change the image size used in FID to 256 or 224. I don't quite remember the setting here, since that the inception network is trained at a specific resolution.
Sorry for bothering again. I tried to compute the realism_fid with two groups. Group1 with the argument(--img_size = 256), and compute fid between the fake images(256256, generated with the 256.config and 256.checkpoint) and real images, group2 with the same argument(--img_size = 256), and compute fid between the fake images(128128, generated with the 128.config and 128.checkpoint) and real images, but it seems that the results are getting worse... That is so wired.
realism_fid(256*256 fake_images and real_images, fid(fake_images, real_images, arg.img_size = 256)): realism_fid_latent_0: 37.70455934722888 realism_fid_reference_0: 38.05122125169506 realism_fid_latent_1: 37.59272856627348 realism_fid_reference_1: 37.81830888013152 realism_fid_latent_2: 37.698022304952914 realism_fid_reference_2: 38.03778528813959 realism_fid_latent_3: 37.610822585752246 realism_fid_reference_3: 38.0628089612687 realism_fid_latent_4: 37.688711544348806 realism_fid_reference_4: 37.91353803968795 realism_fid_latent_average: 37.65896886971126 realism_fid_reference_average: 37.976732484184566
realism_fid(128*128 fake_images and real_images, fid(fake_images, real_images, arg.img_size = 256)): realism_fid_latent_0: 69.20908546448136 realism_fid_reference_0: 69.23383364990423 realism_fid_latent_1: 69.11336443484716 realism_fid_reference_1: 69.34028775602908 realism_fid_latent_2: 69.18649394941102 realism_fid_reference_2: 69.52593890927548 realism_fid_latent_3: 69.09191563199727 realism_fid_reference_3: 69.40835510741587 realism_fid_latent_4: 69.0797907168618 realism_fid_reference_4: 69.28953132218695 realism_fid_latent_average: 69.13613003951971 realism_fid_reference_average: 69.35958934896232
Could you share some real images in test_bangs_with.txt?
Could you share some real images in test_bangs_with.txt?
There are the first five images in test_bangs_with:
@oldrive I don't know if this is the reason. In my experiments, the real images are also resized to specific resolution first and saved in a folder just like @HyZhu39 did:
I resized and saved the images that calculated FID with as "easy_use.py" did:
transform = transforms.Compose([transforms.Resize(image_size), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
x = transform(Image.open('image_save_path here').convert('RGB')).unsqueeze(0)
vutils.save_image(((x + 1) / 2), save_path, padding=0)
Could you have a try?
@oldrive I don't know if this is the reason. In my experiments, the real images are also resized to specific resolution first and saved in a folder just like @HyZhu39 did:
I resized and saved the images that calculated FID with as "easy_use.py" did:
transform = transforms.Compose([transforms.Resize(image_size), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
x = transform(Image.open('image_save_path here').convert('RGB')).unsqueeze(0)
vutils.save_image(((x + 1) / 2), save_path, padding=0)
Could you have a try?
Before computing fid use real images, I did not resize them or save them to a folder, I'll have a try.
@oldrive I don't know if this is the reason. In my experiments, the real images are also resized to specific resolution first and saved in a folder just like @HyZhu39 did:
I resized and saved the images that calculated FID with as "easy_use.py" did:
transform = transforms.Compose([transforms.Resize(image_size), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
x = transform(Image.open('image_save_path here').convert('RGB')).unsqueeze(0)
vutils.save_image(((x + 1) / 2), save_path, padding=0)
Could you have a try?
Oh, the reason is that, and I get the result of realism fid and disentangle fid closer to the paper. realism_fid: realism_fid_latent_0: 20.912922731584082 realism_fid_reference_0: 21.046019767649355 realism_fid_latent_1: 20.76848449633095 realism_fid_reference_1: 21.04662247713575 realism_fid_latent_2: 20.800978320503397 realism_fid_reference_2: 21.0600877899802 realism_fid_latent_3: 20.775910991635065 realism_fid_reference_3: 20.92837823926883 realism_fid_latent_4: 20.68396588649034 realism_fid_reference_4: 20.94170026977707 realism_fid_latent_average: 20.788452485308767 realism_fid_reference_average: 21.004561708762242
disentangle_fid: disentangle_fid_latent_0: 71.39510730377387 disentangle_fid_reference_0: 70.64971902519095 disentangle_fid_latent_1: 71.06008491519601 disentangle_fid_reference_1: 70.88973207966575 disentangle_fid_latent_2: 71.40558227571222 disentangle_fid_reference_2: 71.33517553604398 disentangle_fid_latent_3: 71.2109615470645 disentangle_fid_reference_3: 71.0546303462186 disentangle_fid_latent_4: 71.48734756970637 disentangle_fid_reference_4: 71.08293051285575 disentangle_fid_latent_average: 71.31181672229059 disentangle_fid_reference_average: 71.00243749999501
Thank you for your patient help and quick reply again. With your help can I reproduce the quantitative experiment results in the paper. 由衷地感谢作者大大的热心帮助,祝愿作者大大今后的科研工作一路顺风~
Ideally it should be the same for these two resizing steps. I think the reason maybe the the transform.resize module. As this link says, when inputing a PIL image, the resize function would use antialias mode by default.
不客气哈,也同样非常感谢关注这篇工作。一切顺利!
@HyZhu39 Hello, how you get the FID between images generated by 5 style codes and the real images? The generated images for 5 style codes should be put into 5 folders as expected and calculate the average FID between each of them and the real images. Each folder has the same number of images as the original source images. For disentanglement in our experiments, the reference-guided style codes are randomly sampled from all images with bangs.
Sorry to disturb you, I am reshowing the experimental results of this paper. There are 568 real images with bangs, and 2432 images without bangs. After translation, I will get 2432 imgs with bangs. May I ask whether I should directly calculate FID for these two photo sets or select 568 images from 2432 images for calculation? Looking forward to your reply.Thank you!
Yes. The FID evaluation separately calculates the distribution mean and var of two folders, so you don't need to worry about the different number of images. @zhushuqi2333
Thank you for your reply~I have carefully read all the answers to this question and conducted relevant experiments. All my experiments are carried out under 128 x 128 pictures. Because I trained the model by using celeba-hq.yaml,the resolution of which is 128 x128 The experimental configuration is as follows:
I have one difference from the above content, which is the calculation of FID -- img_ size=128. All the resolution of real_imgs is 128 X128, and the size of all translated pictures is also 128 X128. My experimental results are as follows:
realism: disentanglement: L: 22.63 L: 72.46 R: 21.17 R: 71.63 G: 1.46 G: 0.83
compared to the paper's results: Realism: Disentanglement: L:21.37 L:71.85 R:21.49 R:71.48 G:0.12 G:0.37
L is a little bigger than the paper's, G is too big. Can you give some suggestions?Looking forward to your reply~
I think the difference between these two results is acceptable if you only calculate once. You can try:
@zhushuqi2333
Thank you for your reply~Results above are average,I will try many different seeds and use different checkpoint.
First of all, congratulations on the results of the research, and thank you for the concise and understandable code implementation.
But I still encountered some problems when trying to reproduce the quantitative experiment results in the paper, I did as follow: Realism:
Disentanglement:
Then I got FID results: L:25.05 R:25.21 G:0.16 in the "Realism" experiment L:85.75 R:84.45 G:1.30 in the "Disentanglement" experiment While In the paper: L:21.37 R:21.49 G:0.12 in the "Realism" experiment L:71.85 R:71.48 G:0.37 in the "Disentanglement" experiment
Although there are several random factors in many places in the experiment, it is normal for the FID results to have fluctuations, but these results are too bad.
I think there must be something wrong with my data processing, or the training method. So, could you please explain the data used in the quantitative experiment and the method of data processing in detail? If possible, could you please release the model of the paper's experiment config?