Closed QAQdev closed 5 months ago
This should work.
import os
import json
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--prediction", type=str, required=True)
parser.add_argument("--groundtruth", type=str, required=True)
args = parser.parse_args()
pred = [json.loads(line) for line in open(args.prediction, 'r')]
# pred = json.load(open(args.prediction, 'r'))
gt = json.load(open(args.groundtruth, 'r'))
name = args.prediction.split('/')[-1].split('.')[0]
assert len(pred) == len(gt)
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
total_length = len(pred)
res_flag = []
res_pair = []
wrong_ = []
for i in range(total_length):
if 'text' in pred[i]:
each_pred = pred[i]['text'].strip().strip("</s>").lower()
elif 'answer' in pred[i]:
each_pred = pred[i]['answer'].strip().strip("</s>").lower()
else:
each_pred = pred[i]['output'].strip().strip("</s>").lower()
each_gt = gt[i]['conversations'][1]['value'].strip().lower()
if is_number(each_gt) and is_number(each_pred):
num_gt = float(each_gt)
num_pred = float(each_pred)
if (num_pred < num_gt * 1.05) and (num_pred > num_gt*0.95):
res_flag.append(True)
# if num_pred == num_gt:
# res_flag.append(True)
else:
res_flag.append(False)
else:
if each_pred == each_gt:
res_flag.append(True)
else:
res_flag.append(False)
res_pair.append({'flag':res_flag[i], 'pred':each_pred, 'ans':each_gt})
if not res_flag[-1]:
wrong_.append(gt[i]['id'])
print(sum(res_flag) / len(res_flag))
with open('res_flag.json', 'w') as f:
json.dump(res_flag, f)
with open('answer.json', 'w') as f:
json.dump(res_pair, f, indent=4)
with open('wrong_.json', 'w') as f:
json.dump(wrong_, f, indent=4)
Thanks for your kind reply, but I still have questions.
In the code, the only operation of extracting predictions is each_pred = pred[i]['output'].strip().strip("</s>").lower()
. I wonder when inferencing, did you use any prompts to make sure the model response only contains the expected answers?
Thanks for your kind reply, but I still have questions. In the code, the only operation of extracting predictions is
each_pred = pred[i]['output'].strip().strip("</s>").lower()
. I wonder when inferencing, did you use any prompts to make sure the model response only contains the expected answers?
You can add "Answer the question using a single word or phrase." at the end of the question to further improve the results.
Thanks for your kind reply, but I still have questions. In the code, the only operation of extracting predictions is
each_pred = pred[i]['output'].strip().strip("</s>").lower()
. I wonder when inferencing, did you use any prompts to make sure the model response only contains the expected answers?You can add "Answer the question using a single word or phrase." at the end of the question to further improve the results.
Yes, I tried this type of prompt earlier, but the model does not follow my instructions. So I reach out for your kind help.
Thanks for your kind reply, but I still have questions. In the code, the only operation of extracting predictions is
each_pred = pred[i]['output'].strip().strip("</s>").lower()
. I wonder when inferencing, did you use any prompts to make sure the model response only contains the expected answers?You can add "Answer the question using a single word or phrase." at the end of the question to further improve the results.
Yes, I tried this type of prompt earlier, but the model does not follow my instructions. So I reach out for your kind help.
I guess you need to check whether the Lora modules are successfully loaded. To my own experience, similar problems only occur when using models like the original LLaVA-1.5, which are not SFTed.
Hello team, I want to know did you open source the scripts that calculate model accuracy / score on ChartQA, as you mentioned in the paper, i.e. relaxed accuracy.
In other words, I want to know exactly how to calculate the scores in Tab. 2:
Looking forward to your response, thanks!