Evaluation settings question

ttaa9 commented 4 years ago

I was wondering if you could clarify some evaluation details for the models/datasets listed in the readme tables. In particular:

1) Which training splits were used (e.g., "train" for training and "test" for testing, "all" for both, etc...)?

2) Were the scores obtained from a single training run of a model? Or, were multiple models trained with the same hyper-parameters, and the best model used?

Apologies if these details are already written somewhere!

kwotsin commented 4 years ago

Hi @ttaa9 , thanks for your questions! The training splits follow the splits available for the datasets. For example, for CIFAR-10 there are train and test splits, but similar to many works I used train split instead. On the other hand, for STL-10, there is the unlabeled version from the dataset, which is also most commonly used. For the scores, they are obtained from a single training run and no retraining of the model to find the best one was done.

ttaa9 commented 4 years ago

Hi @kwotsin , thanks so much for the quick reply. It might be useful to put this info about splits somewhere in the readme so we know which splits to test against to compare against your scores.

Also, the reason I asked about multiple runs is that when I run the GANs myself I don't get quite the same scores. E.g., on celeb-A (128 x 128), I get FID/KID of 13.08/0.00956 (versus your 12.93/0.0076) when training on the train set and evaluating on the test set. (It's a bit worse, 13.39/0.010, when evaluating on the training set, unsurprisingly). FID is quite close but KID is a bit off. So I am wondering whether this is simply stochasticity across training runs versus a difference in the training settings. Perhaps you could post your Trainer settings/object as well as just the architectures, which you currently have?

kwotsin commented 4 years ago

Hi @ttaa9 , no worries! Indeed, on the splits, the information is currently listed under the "Baselines", which is also available for all other datasets tested. To clarify, similar to many existing works, the same split was used for both training and evaluation for each dataset. On the training settings and architectures, these are listed on the README page as well, which are the same ones used for the checkpoint.

On the CelebA run, I think your obtained FID score looks correct, with the difference quite similar to the error interval (which as you mentioned, is probably due to the stochasticity across different training runs). For the KID score, could you check and see if the JSON file scores have any anomalous scores? For example, my current JSON file for the KID scores have the following values:

[
    0.007495319259681859,
    0.007711712250735898,
    0.007619357938282523
]

I suspect an anomalous reading could affect the KID score significantly, which is not surprising since I noticed it can happen even for FID -- e.g. at the same checkpoint, generating using a different random seed can sometimes give few hundred FID points instead of the 20+ points from other readings, although this is very rare. I've re-run the evaluation with the given checkpoint and have gotten a similar score as well: 0.007659641506459136 (± 7.556746387021168e-06). Given that your obtained FID is similar to the one I got, I suspect the KID score might have an anomaly for one of the readings.

To reproduce the scores for KID CelebA, you can download the checkpoint file and run this minimal script:

import torch
import torch_mimicry as mmc
from torch_mimicry.nets import sngan

# Replace with checkpoint file from CelebA 128x128, SNGAN. https://drive.google.com/open?id=1rYnv2tCADbzljYlnc8Ypy-JTTipJlRyN
ckpt_file = "/path/to/checkpoints/netG/netG_100000_steps.pth"

# Default variables
log_dir = './examples/example_log_celeba'
dataset = 'celeba_128'
device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")

# Restore model
netG = sngan.SNGANGenerator128().to(device)
netG.restore_checkpoint(ckpt_file)

# Metrics
scores = []
for seed in range(3):
    score = mmc.metrics.kid_score(num_samples=50000,
                                  netG=netG,
                                  seed=seed,
                                  dataset=dataset,
                                  log_dir=log_dir,
                                  device=device)

    scores.append(score)

print(scores)

Feel free to let me know if this is helpful!

kwotsin commented 4 years ago

Closing this issue for now, but feel free to let me know if you have more questions!

kwotsin / mimicry

Evaluation settings question #30