Open ljjcoder opened 5 years ago
@ljjcoder This is the right script to test FID. However, since it's using Inception model, make sure that you run the script on GPU using --gpu
flag:
python ./scripts/fid_score.py --path [path_to_ground_truth] [path_to_output] --gpu 1
The warning message you are receiving is because the pre-trained Inception model is trained using PyTorch < 1.0.0 and can be ignored for now!
Measuring FID is a very CPU intensive process and it requires a lot of RAM, and you normally need 10,000+ images to get an acceptable result! In our experiments, on a Titan V GPU and an Intel Xeon with 8 cores, it takes more than 2 minutes to calculate and it eats up almost 25G of RAM.
@knazeri ,thanks for your reply!when test the fid on places2 ,the score is reasonable(the better result from a human point of view will get lower fid scores).But when test it on celebA, the score seems to not depict the quality of the generated image(the better result from a human point of view will get higher fid scores). Maybe the the inception model trained on imagenet is not suitalbe for test faces result?
@ljjcoder The inception model is only used to extract deep features from input images. The Frechet distance measures the distance between two multivariate normals, that means the input distribution has to be diverse (large) enough to be considered normal! We used 10,000 images to evaluate FID on CelebA and Figure 13 in our paper shows the efficacy of FID! How many images did you test with? What are the FID values you are receiving?
@knazeri , On celebA, the mask hole is 128x128 square, and the image is 256x256, I test 1000 images, I get fid score is 14.24 use CA model. but I use another model to produce the inpainted results which are better than CA, geting higher result 14.86. there are some CA result: there are some another model result:
@ljjcoder First off, 1,000 images are not enough to capture an entire distribution. I've seen in papers that the FID is reported for more than 10,000 (sometimes 25,000) images! Second, please note that FID takes an entire distribution into account and measures the distance between the mean and covariance of two distributions. That means even though some images might not be visually pleasing, the overall quality (of the entire set) might be good! Having said that, none of these quantitative measures (FID included) are perfect! Still, the human study remains the best qualitative measure to evaluate generative models!
Hello, when calculating FID, do you compare 10,000 result pictures with 10,000 GT pictures, or compare 10,000 result pictures with all GT pictures?
when I run the command python ./scripts/fid_score.py --path /userhome/inpaint_bord/data/places2_gt_1000/ /userhome/edge-connect-master/checkpoints/results/ (the /userhome/inpaint_bord/data/places2_gt_1000/ contains 1000 really images and /userhome/edge-connect-master/checkpoints/results/ contains 1000 inpainted images), the process is seized up and stopped. the log is like : calculate path1 statistics... calculate path2 statistics... ./scripts/fid_score.py:86: UserWarning: volatile was removed and now has no effect. Use
with torch.no_grad():
instead. batch = Variable(batch, volatile=True) /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:2351: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead. warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.") /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:2423: UserWarning: Default upsampling behavior when mode=bil inear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. "See the documentation of nn.Upsample for details.".format(mode))is it used correct to test fid? and if i want to test fid about celebA faces, is it also use the inception model trained on imagenet or retrain the model on celebA faces?