knazeri / edge-connect

EdgeConnect: Structure Guided Image Inpainting using Edge Prediction, ICCV 2019 https://arxiv.org/abs/1901.00212
http://openaccess.thecvf.com/content_ICCVW_2019/html/AIM/Nazeri_EdgeConnect_Structure_Guided_Image_Inpainting_using_Edge_Prediction_ICCVW_2019_paper.html
Other
2.53k stars 532 forks source link

how to test fid? #47

Open ljjcoder opened 5 years ago

ljjcoder commented 5 years ago

when I run the command python ./scripts/fid_score.py --path /userhome/inpaint_bord/data/places2_gt_1000/ /userhome/edge-connect-master/checkpoints/results/ (the /userhome/inpaint_bord/data/places2_gt_1000/ contains 1000 really images and /userhome/edge-connect-master/checkpoints/results/ contains 1000 inpainted images), the process is seized up and stopped. the log is like : calculate path1 statistics... calculate path2 statistics... ./scripts/fid_score.py:86: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. batch = Variable(batch, volatile=True) /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:2351: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.") /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:2423: UserWarning: Default upsampling behavior when mode=bil inear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode))

is it used correct to test fid? and if i want to test fid about celebA faces, is it also use the inception model trained on imagenet or retrain the model on celebA faces?

knazeri commented 5 years ago

@ljjcoder This is the right script to test FID. However, since it's using Inception model, make sure that you run the script on GPU using --gpu flag:

python ./scripts/fid_score.py --path [path_to_ground_truth] [path_to_output] --gpu 1

The warning message you are receiving is because the pre-trained Inception model is trained using PyTorch < 1.0.0 and can be ignored for now!

Measuring FID is a very CPU intensive process and it requires a lot of RAM, and you normally need 10,000+ images to get an acceptable result! In our experiments, on a Titan V GPU and an Intel Xeon with 8 cores, it takes more than 2 minutes to calculate and it eats up almost 25G of RAM.

ljjcoder commented 5 years ago

@knazeri ,thanks for your reply!when test the fid on places2 ,the score is reasonable(the better result from a human point of view will get lower fid scores).But when test it on celebA, the score seems to not depict the quality of the generated image(the better result from a human point of view will get higher fid scores). Maybe the the inception model trained on imagenet is not suitalbe for test faces result?

knazeri commented 5 years ago

@ljjcoder The inception model is only used to extract deep features from input images. The Frechet distance measures the distance between two multivariate normals, that means the input distribution has to be diverse (large) enough to be considered normal! We used 10,000 images to evaluate FID on CelebA and Figure 13 in our paper shows the efficacy of FID! How many images did you test with? What are the FID values you are receiving?

ljjcoder commented 5 years ago

@knazeri , On celebA, the mask hole is 128x128 square, and the image is 256x256, I test 1000 images, I get fid score is 14.24 use CA model. but I use another model to produce the inpainted results which are better than CA, geting higher result 14.86. there are some CA result: 162771 162777 there are some another model result: 162771 162777

knazeri commented 5 years ago

@ljjcoder First off, 1,000 images are not enough to capture an entire distribution. I've seen in papers that the FID is reported for more than 10,000 (sometimes 25,000) images! Second, please note that FID takes an entire distribution into account and measures the distance between the mean and covariance of two distributions. That means even though some images might not be visually pleasing, the overall quality (of the entire set) might be good! Having said that, none of these quantitative measures (FID included) are perfect! Still, the human study remains the best qualitative measure to evaluate generative models!

codinglin commented 3 years ago

Hello, when calculating FID, do you compare 10,000 result pictures with 10,000 GT pictures, or compare 10,000 result pictures with all GT pictures?