Evaluation scripts that calculate accuracy of performance on ChartQA

QAQdev commented 5 months ago

Hello team, I want to know did you open source the scripts that calculate model accuracy / score on ChartQA, as you mentioned in the paper, i.e. relaxed accuracy.

In other words, I want to know exactly how to calculate the scores in Tab. 2:

Looking forward to your response, thanks!

tingxueronghua commented 5 months ago

This should work.

import os 
import json 
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--prediction", type=str, required=True)
parser.add_argument("--groundtruth", type=str, required=True)
args = parser.parse_args()

pred = [json.loads(line) for line in open(args.prediction, 'r')]
# pred = json.load(open(args.prediction, 'r'))
gt = json.load(open(args.groundtruth, 'r'))

name = args.prediction.split('/')[-1].split('.')[0]

assert len(pred) == len(gt)

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

total_length = len(pred)
res_flag = []
res_pair = []
wrong_ = []
for i in range(total_length):
    if 'text' in pred[i]:
        each_pred = pred[i]['text'].strip().strip("</s>").lower()
    elif 'answer' in pred[i]:
        each_pred = pred[i]['answer'].strip().strip("</s>").lower()
    else:
        each_pred = pred[i]['output'].strip().strip("</s>").lower()
    each_gt = gt[i]['conversations'][1]['value'].strip().lower() 
    if is_number(each_gt) and is_number(each_pred):
        num_gt = float(each_gt)
        num_pred = float(each_pred)
        if (num_pred < num_gt * 1.05) and (num_pred > num_gt*0.95):
            res_flag.append(True)
        # if num_pred == num_gt:
            # res_flag.append(True)
        else:
            res_flag.append(False)
    else:
        if each_pred == each_gt:
            res_flag.append(True)
        else:
            res_flag.append(False)
    res_pair.append({'flag':res_flag[i], 'pred':each_pred, 'ans':each_gt})
    if not res_flag[-1]:
        wrong_.append(gt[i]['id'])

print(sum(res_flag) / len(res_flag))
with open('res_flag.json', 'w') as f:
    json.dump(res_flag, f)

with open('answer.json', 'w') as f:
    json.dump(res_pair, f, indent=4)
with open('wrong_.json', 'w') as f:
    json.dump(wrong_, f, indent=4)

QAQdev commented 5 months ago

Thanks for your kind reply, but I still have questions. In the code, the only operation of extracting predictions is each_pred = pred[i]['output'].strip().strip("</s>").lower(). I wonder when inferencing, did you use any prompts to make sure the model response only contains the expected answers?

tingxueronghua commented 5 months ago

Thanks for your kind reply, but I still have questions. In the code, the only operation of extracting predictions is each_pred = pred[i]['output'].strip().strip("</s>").lower(). I wonder when inferencing, did you use any prompts to make sure the model response only contains the expected answers?

You can add "Answer the question using a single word or phrase." at the end of the question to further improve the results.

QAQdev commented 5 months ago

Thanks for your kind reply, but I still have questions. In the code, the only operation of extracting predictions is each_pred = pred[i]['output'].strip().strip("</s>").lower(). I wonder when inferencing, did you use any prompts to make sure the model response only contains the expected answers?

You can add "Answer the question using a single word or phrase." at the end of the question to further improve the results.

Yes, I tried this type of prompt earlier, but the model does not follow my instructions. So I reach out for your kind help.

tingxueronghua commented 5 months ago

Thanks for your kind reply, but I still have questions. In the code, the only operation of extracting predictions is each_pred = pred[i]['output'].strip().strip("</s>").lower(). I wonder when inferencing, did you use any prompts to make sure the model response only contains the expected answers?

You can add "Answer the question using a single word or phrase." at the end of the question to further improve the results.

Yes, I tried this type of prompt earlier, but the model does not follow my instructions. So I reach out for your kind help.

I guess you need to check whether the Lora modules are successfully loaded. To my own experience, similar problems only occur when using models like the original LLaVA-1.5, which are not SFTed.

tingxueronghua / ChartLlama-code

Evaluation scripts that calculate accuracy of performance on ChartQA #17