Results are different from other FID sources

Ghaleb-alnakhlani commented 2 years ago

Hi, I have run the evaluation test with this code

import os
import piq
import torch
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from piq.feature_extractors import InceptionV3
from PIL import Image

def getFilePaths(path):
    # read a folder, return the complete path
    ret = []
    for root, dirs, files in os.walk(path):
        for filespath in files:
            ret.append(os.path.join(root, filespath))
    ret.sort()
    return ret

class DataProcess(torch.utils.data.Dataset):
    def __init__(self, s_src, img_w, img_h):
        super(DataProcess, self).__init__()
        self.img_w = img_w
        self.img_h = img_h

        self.img_transform = transforms.Compose([
            transforms.Resize((self.img_h, self.img_w)),
            transforms.ToTensor(),  # transform to [0, 1]
        ])

        self.f_srcs = getFilePaths(s_src)

    def __getitem__(self, index):
        src = Image.open(self.f_srcs[index])
        t_src = self.img_transform(src.convert('RGB'))

        return {
            'images': t_src,
        }

    def __len__(self):
        return len(self.f_srcs)

if __name__ == '__main__':   
    s_1 = ('./datasets/name/train_B')
    s_2 = ('./results/name/synthesized')

    set_1 = DataProcess(s_1, 512, 512)
    set_2 = DataProcess(s_2, 512, 512)

    loader_1 = torch.utils.data.DataLoader(set_1, batch_size=1, shuffle=False)
    loader_2 = torch.utils.data.DataLoader(set_2, batch_size=1, shuffle=False)

    fid_metric = piq.FID()
    model = InceptionV3()
    feat_1 = fid_metric.compute_feats(loader_1, model)
    feat_2 = fid_metric.compute_feats(loader_2, model)
    fid = fid_metric.compute_metric(feat_1, feat_2)
    print(f'====> fid: {fid}')

And this is the result that I have got, which is reasonable from my observation.

fid: 19.67474915673438

After that I have tested the pytorch-fid from this repo.

!python -m pytorch_fid /content/drive/MyDrive/pix2pixHD/datasets/name/train_B/ /content/drive/MyDrive/pix2pixHD/results/name/synthesized/ --batch-size 1

And this the result FID: 6.98900379066589

The third method that I have tested is from this repo clean-fid, I have used the leagacy-mode which is equal the above pytorch-fid.

from cleanfid import fid
score_clean = fid.compute_fid("/content/drive/MyDrive/pix2pixHD/datasets/name/train_B/", "/content/drive/MyDrive/pix2pixHD/results/name/synthesized/", mode="legacy_pytorch", batch_size=1, num_workers=0)
print(f"clean-fid score is {score_clean:.3f}")

And this is the result that I have got clean-fid score is 7.205

So the question is which one of these results is correct? And why I have different results from testing the same dataset and using the same model InceptionV3?

Ghaleb-alnakhlani commented 2 years ago

After running the Clean-fid legacy again I got the same results as the pytorch_fid. So now we only compare between piq results and the other two Clean-fid legacy and pytorch_fid. There is a huge gap between them.

zakajd commented 2 years ago

Hi! Thanks for raising the issue. Have you checked if the error is in FID computation step or in feature extraction step? We had similar discussions some time ago(one, two) and verified that our results are close to pytorch_fid.

# !pip install pytorch-fid

import piq
import torch
import numpy as np

# Code from github.com/mseitzer/pytorch-fid
from pytorch_fid.fid_score import calculate_frechet_distance

dist1_np = np.random.normal(150, 8.0, size=(100000, 500))
dist2_np = np.random.normal(150, 8.0, size=(100000, 500))

dist1_np_mu = np.mean(dist1_np, axis=0)
dist1_np_sigma = np.cov(dist1_np, rowvar=False)

dist2_np_mu = np.mean(dist2_np, axis=0)
dist2_np_sigma = np.cov(dist2_np, rowvar=False)

mseitzer_output = calculate_frechet_distance(dist1_np_mu, dist1_np_sigma, dist2_np_mu, dist2_np_sigma)
print(f'{mseitzer_output:0.4f}')

dist1_pt = torch.tensor(dist1_np)
dist2_pt = torch.tensor(dist2_np)
piq_output = piq.FID()(dist1_pt, dist2_pt)
print(piq_output)

>>> 81.0782
>>>tensor(81.0783, dtype=torch.float64)

Ghaleb-alnakhlani commented 2 years ago

Hi thank you for your response. Actually, I have doubts now about my implementation, I might doing all the evaluation in the wrong way. First when i run the code for pytorch-fid the only thing that was mentioned in the repo was that you need to pass the path to the real images and fake images and specify the batch size similar to this python -m pytorch_fid path/to/dataset1 path/to/dataset2and you can verify that by going to the repository, and that exactly what I did? I don't know if it is correct or not? The second thing I want to use piq for all the metrics I have many validation to do and I see that piq has most of them, however, the tool lacks a clear explanation on how to evaluate, by giving the path to the real and generated images. I need to clarify that I am running the evaluation after the training so I have to provide the real images path and the generated images path. So I would really appreciate it if you can provide a complete clear example of how to run the feature extraction step plus the computation of FID, KID and others like LPIPS, given the path to the images. You have no idea how helpful this would be.

zakajd commented 2 years ago

@Ghaleb-alnakhlani please see the results_benchmark.py script. It has everything you need and allows to benchmark all metrics agains dataset with human labels. You may need to delete some of the code and slightly modify the Dataset class for this to work.

snk4tr commented 2 years ago

@Ghaleb-alnakhlani one suggestion on top of what @zakajd just mentioned is to verify what pre-processing is done in other code sources (libraries, frameworks) that you are using. For instance, different normalization of input data may heavily influence the results. Other than that, I would suggest you to check other discussions and the benchmark script that were referenced by @zakajd.

Ghaleb-alnakhlani commented 2 years ago

@zakajd thank you for the script I am trying to understand how to change the code to adapt to my use case. I want to say that the dataste I am using now is not labeled. I am trying to find how to change the benchmark script to work with my dataset. @zakajd I thought that most of the FID for example uses the same structure, first, they use InceptionV3 which is the case for all three sources mentioned above. About the preprocessing I am not quite sure, to be honest, I haven't looked into that. But still the question remains why on pytorch-fid doc they have never mentioned the preprocessing step. If you go to their repo you will find how to use snippet and it is only one line. For me, I would assume that was the result same for Clean-Fid. So does that mean the results will be wrong without preprocessing step? Because it is clear the difference is huge.

snk4tr commented 2 years ago

@Ghaleb-alnakhlani data formatting and pre-processing is the only thing that can be different and hence cause the discrepancy that you observe. As mentioned by @zakajd, we have already been asked that question a couple of times and in both cases we found out that the FID computation that we provide is correct but the pre-processing that is done is a different way may cause different results.

With that, I do not see any point of discussion now. If you find an actual bug in the piq code, feel free to re-open the issue.

photosynthesis-team / piq

Results are different from other FID sources #320