tgxs002 / HPSv2

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
Apache License 2.0
403 stars 12 forks source link

Do you test in HPSv1 datasets by using HPSv2 checkpoint? #7

Closed LinB203 closed 1 year ago

LinB203 commented 1 year ago

Hi, I use HPSv2 checkpoint to test HPSv1 datasets, and I get 59.51% acc. But if I use HPSv1 checkpoint, I get 65.44% acc. Why make it worse? Domain adapter? Btw, the aesthetic predictor will get 55.57% acc in HPSv1. Does it normal? The num_images is a tensor of 2, such as [2, 2, 2, 2...]. HPSv2 checkpoint to test HPSv1 datasets

    for batch in bar:
        images, num_images, labels, caption, rank = batch
        images = images.cuda()
        num_images = num_images.cuda()
        # labels = labels.cuda()
        caption = caption.cuda()
        rank = rank.cuda()

        with torch.no_grad():
            image_features = model.encode_image(images)
            text_features = model.encode_text(caption)

            image_features = image_features / image_features.norm(dim=-1, keepdim=True)
            text_features = text_features / text_features.norm(dim=-1, keepdim=True)

            logits_per_image = image_features @ text_features.T
            paired_logits_list = [logit[:, i] for i, logit in enumerate(logits_per_image.split(num_images.tolist()))]
        predicted = [torch.argsort(-k) for k in paired_logits_list]
        hps_ranking = [[predicted[i].tolist().index(j) for j in range(n)] for i, n in enumerate(num_images)]
        rank = [i for i in rank.split(num_images.tolist())]
        score += sum([inversion_score(hps_ranking[i], rank[i]) for i in range(len(hps_ranking))])
    ranking_acc = score / total
    print(ranking_acc)

HPSv1 checkpoint to test HPSv1 datasets

    for batch in bar:
        images, num_images, labels, caption, rank = batch
        images = images.cuda()
        num_images = num_images.cuda()
        # labels = labels.cuda()
        caption = caption.cuda()
        rank = rank.cuda()

        with torch.no_grad():
            with torch.cuda.amp.autocast():
                outputs = model(images, caption)
                image_features, text_features, logit_scale = outputs["image_features"], outputs["text_features"], outputs[
                    "logit_scale"]
                logits_per_image = logit_scale * image_features @ text_features.T 
                paired_logits_list = [logit[:, i] for i, logit in enumerate(logits_per_image.split(num_images.tolist()))]

        predicted = [torch.argsort(-k) for k in paired_logits_list]
        hps_ranking = [[predicted[i].tolist().index(j) for j in range(n)] for i, n in enumerate(num_images)]
        rank = [i for i in rank.split(num_images.tolist())]
        score += sum([inversion_score(hps_ranking[i], rank[i]) for i in range(len(hps_ranking))])
    ranking_acc = score / total * 100
    print(ranking_acc)
tgxs002 commented 1 year ago

I guess there is something wrong with the numbers emm... The best performance I got on HPD v1 was around 43. num_images should be the number of images with the same prompt in a group. The number is typically 4 or 3 for HPD v1.

LinB203 commented 1 year ago

I guess there is something wrong with the numbers emm... The best performance I got on HPD v1 was around 43. num_images should be the number of images with the same prompt in a group. The number is typically 4 or 3 for HPD v1.

Sorry for my late reply. But HPD v1 only specified 1 prefered images in 3 or 4 images. So should I evaluate it like ImageReward dataset? Because ImageReward dataset also has tie. The rank may be like [1,2,2,2] while a list of 4 images?

tgxs002 commented 1 year ago

In v1, top-1 accuracy is reported, which is different from v2. You can choose to use different ways to evaluate depending on the baseline you are comparing with.

LinB203 commented 1 year ago

In v1, top-1 accuracy is reported, which is different from v2. You can choose to use different ways to evaluate depending on the baseline you are comparing with.

Oh! I got it. I reproduced the number 43.2 on HPD v1 by using HPS v1 checkpoint, which is closed to 43.5 in your paper.

LinB203 commented 1 year ago

In v1, top-1 accuracy is reported, which is different from v2. You can choose to use different ways to evaluate depending on the baseline you are comparing with.

The following results were all evaluated on HPD v1 test split. aesthetic 31.5, which is closed to paper's 31.4 CLIP 33.2, which is closed to paper's 32.9 HPS v1 43.2, which is closed to paper's 43.5 HPS v2 36.6, paper N/A ImageReward 36.0, paper N/A

Do you think it is normal?

tgxs002 commented 1 year ago

That might be correct. There might be a gap between the data between v1 and v2, because that in v1 the data is not collected without directly prompting the users for their preference.

LinB203 commented 1 year ago

That might be correct. There might be a gap between the data between v1 and v2, because that in v1 the data is not collected without directly prompting the users for their preference.

That's reasonable. I futher evaluated more results, and got the following number. The following results were all evaluated on HPD v2 test split.

aesthetic 76.8, paper 72.6 CLIP 62.5, paper N/A HPS v1 77.6, 73.1 ImageReward 74.0, paper 70.6 HPS v2 83.3, paper 83.3

I reproduced the number of HPS v2 perfectly, but there is a margin in other methods. Anything I miss?

LinB203 commented 1 year ago

@tgxs002 Could you help me to reproduce your results?

tgxs002 commented 1 year ago

Sorry for the late reply, we are investigating this issue.

tgxs002 commented 1 year ago

@LinB203 We have checked our record, and there was indeed a bug in an earlier version of our code, which was used to evaluate the baselines. We will provide a detailed explanation for the error in this thread these days, and update the preprint ASAP. Thank you for pointing out the error!

tgxs002 commented 1 year ago

@LinB203 The difference is because of an out-dated evaluation protocol. When evaluating aesthetic, HPS v1 and ImageReward, we firstly compute the accuracy against the label given by each annotator (10 annotators for each instance in the test set); However, for HPS v2, we were using another codebase (this one), and the accuracy is computed differently. The labels by each annotator are firstly aggregated into an average label, and the accuracy is computed on that. The annotation file with the raw label by each annotator is now updated in the repo. Thank you again for pointing out the misalignment!

LinB203 commented 1 year ago

@LinB203 The difference is because of an out-dated evaluation protocol. When evaluating aesthetic, HPS v1 and ImageReward, we firstly compute the accuracy against the label given by each annotator (10 annotators for each instance in the test set); However, for HPS v2, we were using another codebase (this one), and the accuracy is computed differently. The labels by each annotator are firstly aggregated into an average label, and the accuracy is computed on that. The annotation file with the raw label by each annotator is now updated in the repo. Thank you again for pointing out the misalignment!

Yes, I just am using code of this repo to reproduce the results. Anyway, the HPS v2 is the best one, haha...