royorel / StyleSDF

Other
536 stars 50 forks source link

Evaluation metrics detail #26

Open yuefeng21 opened 2 years ago

yuefeng21 commented 2 years ago

Hi Roy, I use https://github.com/abdulfatir/gan-metrics-pytorch as suggested from previous issue to calculate FID and KID, but I cannot generate similar evaluation number as in the paper. I sampled 5k 512*512 resolution images from FFHQ and your model. and the FID I got is 214.33 (0.988), KID 0.233 (0.003) . However in the paper it was 11.5, KID 2.65. Could you please how exactly you generate the FID and KID of your result?

Appreciate your help.

royorel commented 2 years ago

Hi @yuefeng21,

Here's what you need to do to replicate the results:

  1. Generate the images without any mean truncation.
  2. Generate the images at 1024x1024 (or 512x512 for AFHQ) and downsample the outputs to 256x256.

Let me know if there are any further issues.

yuefeng21 commented 2 years ago

You mean by setting truncation_ratio to 1 and mean_latent to None?

yuefeng21 commented 2 years ago

I also have a question, why do people evaluate result on 256 or 512 resolution, not 1024 resolution.

royorel commented 2 years ago

If I recall correctly, setting the truncation ratio to 1 should be enough, but double check me on this.

Regarding 256x256, I'm not sure to be honest. We did 256x256 to make the comparison fair and repeat the same conditions as the baseline methods (as well as validating their reported FID scores). It's also a runtime issue, running this on 20-50k 1024x1024 images can take a lot of time.

yuefeng21 commented 2 years ago

I set the trucation_ratio to 1 and sampled 5k images, 173.72 (0.727) this is what I got of FID

royorel commented 2 years ago

What resolution are you using? Both for the real and generated images

yuefeng21 commented 2 years ago

Real image: I down-sampled them to 256 resolution. generated image: first generated as 1024 resolution and then down-sample to 256 resolution.

yuefeng21 commented 2 years ago

It seems that I save the image with range[-1,1] After I scale them to [0, 255], 4139.23 (41.315) this is what I got for FID

yuefeng21 commented 2 years ago

But none of those result seems related to 11.5 in the paper 🤣

royorel commented 2 years ago

The images should be in [-1,1] range. What is your current KID score, for this? (KID is more robust to small sample size.)

In addition, how many real images did you use?

royorel commented 2 years ago

Looks like the images need to be in [0,1] actually. I just looked at the FID code and there's a normalization (which is set to true as default) in the inception net forward pass.

image

yuefeng21 commented 2 years ago

Current KID is 0.161 (0.003) for 256 resolution with no mean latent. 5k images

yuefeng21 commented 2 years ago

OK I will double check that.

royorel commented 2 years ago

We saved all output images in a single large .npy file, so the line you marked does not apply to our case (in both FID and KID). In addition, if you save the images as jpg then compression artifacts might also effect the scores.

Edit: going through the FID code again, the .npy array should indeed be in [-1,1]. The FID code transforms it to [0,1] and then to [-1,1] again.

yuefeng21 commented 2 years ago

I also save the result in a large .npy file as instructed. and the original range was [-1,1].
Regardless of the FID that is sensitive to dataset size under 256 resolution, [-1,1] range with both real images and generated image, and truncation_ratio = 1. with 5k images, the KID I got is 0.166 (0.002), in the paper it was 2.65. How is this result related to that (2.65)?

royorel commented 2 years ago

You need to multiply the KID result by 1000

yuefeng21 commented 2 years ago

the one inside the brackets (0.002)?

royorel commented 2 years ago

No, that's the variance. You need to multiply the mean.

How many real images do you use?