sczhou / CodeFormer

[NeurIPS 2022] Towards Robust Blind Face Restoration with Codebook Lookup Transformer
Other
14.97k stars 3.2k forks source link

metrics #43

Open 123456789-qwer opened 1 year ago

123456789-qwer commented 1 year ago

In your article, you wrote: For the evaluation on real-world datasets without ground truth, we employ the widely-used non-reference perceptual metrics: FID and NIQE. Is FID a non-reference indicator? I have some trouble understanding this sentence. Thanks!

sczhou commented 1 year ago

Yes, FID is considered a non-reference metric for image quality assessment since NO paired ground truth (the corresponding HQ version) is needed for this metric. The reference dataset just stands for the distribution of high-quality faces.

There is an IQA category, please check it: https://github.com/chaofengc/IQA-PyTorch

chaofengc commented 1 year ago

Thanks for reference, Zhou. I am the author of the IQA-PyTorch package.

As explained by the official implementation of pytorch-fid

FID is a measure of similarity between two datasets of images. FID is calculated by computing the Fréchet distance between two Gaussians fitted to feature representations of the Inception network.

Because it only requires a reference dataset (usually FFHQ in face restoration), we generally regard it as a non-reference metric. It can be implemented simply in several lines of codes as below

def frechet_distance(mu1, sigma1, mu2, sigma2, eps=1e-6):
    """
    Numpy implementation of the Frechet Distance.
    The Frechet distance between two multivariate Gaussians X_1 ~ N(mu_1, C_1)
    and X_2 ~ N(mu_2, C_2) is
            d^2 = ||mu_1 - mu_2||^2 + Tr(C_1 + C_2 - 2*sqrt(C_1*C_2)).
    Stable version by Danica J. Sutherland.
    Params:
        mu1   : Numpy array containing the activations of a layer of the
                inception net (like returned by the function 'get_predictions')
                for generated samples.
        mu2   : The sample mean over activations, precalculated on an
                representative data set.
        sigma1: The covariance matrix over activations for generated samples.
        sigma2: The covariance matrix over activations, precalculated on an
                representative data set.
    """
    mu1 = np.atleast_1d(mu1)
    mu2 = np.atleast_1d(mu2)
    sigma1 = np.atleast_2d(sigma1)
    sigma2 = np.atleast_2d(sigma2)

    assert mu1.shape == mu2.shape, \
        'Training and test mean vectors have different lengths'
    assert sigma1.shape == sigma2.shape, \
        'Training and test covariances have different dimensions'

    diff = mu1 - mu2

    # Product might be almost singular
    covmean, _ = linalg.sqrtm(sigma1.dot(sigma2), disp=False)
    if not np.isfinite(covmean).all():
        msg = ('fid calculation produces singular product; '
               'adding %s to diagonal of cov estimates') % eps
        print(msg)
        offset = np.eye(sigma1.shape[0]) * eps
        covmean = linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))

    # Numerical error might give slight imaginary component
    if np.iscomplexobj(covmean):
        if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
            m = np.max(np.abs(covmean.imag))
            raise ValueError('Imaginary component {}'.format(m))
        covmean = covmean.real

    tr_covmean = np.trace(covmean)

    return (diff.dot(diff) + np.trace(sigma1) + np.trace(sigma2) - 2 * tr_covmean)
sczhou commented 1 year ago

Thanks @chaofengc for your exhaustive explanation 👍

tzm-tora commented 1 year ago

Hi, I am curious about what dataset you used in the FID calculation. For example, in the case of CelebA test, two folders are required, one is model's output, how about another folder? Do you use the whole FFHQ images (70k) as the reference folder? How about the case of LFW-Test and WebPhoto-Test? Also, I found that the FID results in you paper is different from some prior works, such as VQFR and RestoreFormer, could you give me a hint what caused the differences?