Closed Ghaleb-alnakhlani closed 2 years ago
After running the Clean-fid legacy again I got the same results as the pytorch_fid.
So now we only compare between piq
results and the other two Clean-fid legacy and pytorch_fid. There is a huge gap between them.
Hi! Thanks for raising the issue. Have you checked if the error is in FID computation step or in feature extraction step?
We had similar discussions some time ago(one, two) and verified that our results are close to pytorch_fid
.
# !pip install pytorch-fid
import piq
import torch
import numpy as np
# Code from github.com/mseitzer/pytorch-fid
from pytorch_fid.fid_score import calculate_frechet_distance
dist1_np = np.random.normal(150, 8.0, size=(100000, 500))
dist2_np = np.random.normal(150, 8.0, size=(100000, 500))
dist1_np_mu = np.mean(dist1_np, axis=0)
dist1_np_sigma = np.cov(dist1_np, rowvar=False)
dist2_np_mu = np.mean(dist2_np, axis=0)
dist2_np_sigma = np.cov(dist2_np, rowvar=False)
mseitzer_output = calculate_frechet_distance(dist1_np_mu, dist1_np_sigma, dist2_np_mu, dist2_np_sigma)
print(f'{mseitzer_output:0.4f}')
dist1_pt = torch.tensor(dist1_np)
dist2_pt = torch.tensor(dist2_np)
piq_output = piq.FID()(dist1_pt, dist2_pt)
print(piq_output)
>>> 81.0782
>>>tensor(81.0783, dtype=torch.float64)
Hi thank you for your response. Actually, I have doubts now about my implementation, I might doing all the evaluation in the wrong way. First when i run the code for pytorch-fid
the only thing that was mentioned in the repo was that you need to pass the path to the real images and fake images and specify the batch size similar to this python -m pytorch_fid path/to/dataset1 path/to/dataset2
and you can verify that by going to the repository, and that exactly what I did? I don't know if it is correct or not?
The second thing I want to use piq
for all the metrics I have many validation to do and I see that piq
has most of them, however, the tool lacks a clear explanation on how to evaluate, by giving the path to the real and generated images. I need to clarify that I am running the evaluation after the training so I have to provide the real images path and the generated images path.
So I would really appreciate it if you can provide a complete clear example of how to run the feature extraction step plus the computation of FID, KID and others like LPIPS, given the path to the images.
You have no idea how helpful this would be.
@Ghaleb-alnakhlani please see the results_benchmark.py script. It has everything you need and allows to benchmark all metrics agains dataset with human labels. You may need to delete some of the code and slightly modify the Dataset class for this to work.
@Ghaleb-alnakhlani one suggestion on top of what @zakajd just mentioned is to verify what pre-processing is done in other code sources (libraries, frameworks) that you are using. For instance, different normalization of input data may heavily influence the results. Other than that, I would suggest you to check other discussions and the benchmark script that were referenced by @zakajd.
@zakajd thank you for the script I am trying to understand how to change the code to adapt to my use case. I want to say that the dataste I am using now is not labeled. I am trying to find how to change the benchmark script to work with my dataset.
@zakajd I thought that most of the FID for example uses the same structure, first, they use InceptionV3 which is the case for all three sources mentioned above. About the preprocessing I am not quite sure, to be honest, I haven't looked into that.
But still the question remains why on pytorch-fid
doc they have never mentioned the preprocessing step. If you go to their repo you will find how to use snippet and it is only one line. For me, I would assume that was the result same for Clean-Fid.
So does that mean the results will be wrong without preprocessing step? Because it is clear the difference is huge.
@Ghaleb-alnakhlani data formatting and pre-processing is the only thing that can be different and hence cause the discrepancy that you observe. As mentioned by @zakajd, we have already been asked that question a couple of times and in both cases we found out that the FID computation that we provide is correct but the pre-processing that is done is a different way may cause different results.
With that, I do not see any point of discussion now. If you find an actual bug in the piq
code, feel free to re-open the issue.
Hi, I have run the evaluation test with this code
And this is the result that I have got, which is reasonable from my observation.
After that I have tested the pytorch-fid from this repo.
And this the result
FID: 6.98900379066589
The third method that I have tested is from this repo clean-fid, I have used the leagacy-mode which is equal the above pytorch-fid.
And this is the result that I have got
clean-fid score is 7.205
So the question is which one of these results is correct? And why I have different results from testing the same dataset and using the same model InceptionV3?