microsoft / promptbench

A unified evaluation framework for large language models
http://aka.ms/promptbench
MIT License
2.38k stars 178 forks source link

Why are the experimental results different? #84

Open zl-comment opened 3 days ago

zl-comment commented 3 days ago

bertattack

original prompt: Evaluate the sentiment of the given text and classify it as 'positive' or 'negative': original score: 0.4934426229508197 attacked prompt: Evaluate the sеntiment of the given text and classification it as 'positive' or 'negative': attacked score: 0.4852459016393443 PDR: 0.01661129568106314

original prompt: Given the context of this text, indicate if the emotion conveyed is 'positive' or 'negative': original score: 0.5 attacked prompt: Given the context of this text, indicatе if the emoton conveyed is 'positive' or 'negative': attacked score: 0.4737704918032787 PDR: 0.05245901639344264

original prompt: Analyze the tone of this statement and respond with either 'positive' or 'negative': original score: 0.5180327868852459 attacked prompt: Anaⅼyze the tone of this statement and respond with nor 'positive' or 'negative': attacked score: 0.460655737704918 PDR: 0.1107594936708861

original prompt: In the role of a sentiment analysis tool, respond with 'positive' or 'negative' to classify this statement: original score: 0.5229508196721312 attacked prompt: In the role of a sentiment analyse instrumentation, respond with 'positive' or 'negative' to classify this statemen𝚝: attacked score: 0.4639344262295082 PDR: 0.11285266457680258

original prompt: Functioning as a sentiment identification tool, assess if the following expression is 'positive' or 'negative'. Classify: original score: 0.5180327868852459 attacked prompt: Functionіng as a sentiment identіfication tool, assess if the following expression is 'positive' or 'negative'. Classify: attacked score: 0.4639344262295082 PDR: 0.10443037974683543

original prompt: Serving as a sentiment evaluation model, determine if the given statement is 'positive' or 'negative'. Classify: original score: 0.5278688524590164 attacked prompt: Serve as a sentimeոt evaluation modelling, determine if the given statement is 'positive' or 'negative'. Classify: attacked score: 0.4721311475409836 PDR: 0.10559006211180119

The original accuracy rate using the prompt words in the paper is only about 50%, but the original accuracy rate in the source code is about 90%. However, the original accuracy rate of prompt words using the sample codehttps://github.com/microsoft/promptbench/blob/main/examples/prompt_attack.ipynb is more than 90%

bertattack

'original prompt': "Evaluate the sentiment of the given text and classify it as 'positive' or 'negative':\nQuestion: {content}\nAnswer:" 'original score': 0.9459016393442623 'attacked prompt': "Evaluate the sincerity of the received text and rate it as 'positive' or 'negative':\ncuriosity: {content}\nAnswer:" 'attacked score': 0.9147540983606557 'PDR': 0.03292894280762571(0.9147540983606557~0.9557377049180328)

'original prompt': "Given the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':\nQuestion: {content}\nAnswer:" 'original score': 0.9393442622950819 'attacked prompt': "i the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':\nQuestion: {content}\nfield:" 'attacked score': 0.9229508196721311 'PDR': 0.017452006980802813(0.9229508196721311~)

'original prompt': "Given the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':\nQuestion: {content}\nAnswer:" 'original score': 0.9442622950819672 'attacked prompt': "Given the clarity of this section, specify if the emotion depicted is 'positive' or 'negative':\ncuriosity: {content}\nAnswer:" 'attacked score': 0.8885245901639345 'PDR': 0.05902777777777775

'original prompt': "Analyze the tone of this statement and respond with either 'positive' or 'negative':\nQuestion: {content}\nAnswer:" 'original score': 0.9475409836065574 'attacked prompt': "Analyze the syntax of this phrase and address with meaning 'positive' or 'negative':\nriddle: {content}\nAnswer:" 'attacked score': 0.6081967213114754 'PDR': 0.35813148788927335

'original prompt': "In the role of a sentiment analysis tool, respond with 'positive' or 'negative' to classify this statement:\nQuestion: {content}\nAnswer:" 'original score': 0.9540983606557377 'attacked prompt': "in the role of a sentiment analysis game, respond with 'positive' or 'negative' to classify this function:\ncuriosity: {content}\nquestions:" 'attacked score': 0.8918032786885246 'PDR': 0.06529209621993128

'original prompt': "Functioning as a sentiment identification tool, assess if the following expression is 'positive' or 'negative'. Classify:\nQuestion: {content}\nAnswer:" 'original score': 0.9475409836065574 'attacked prompt': "Functioning as a emotion correlation tool, study if the following expression is 'positive' or 'negative'. find:\ncuriosity: {content}\nAnswer:" 'attacked score': 0.8918032786885246 'PDR': 0.05882352941176468

'original prompt': "Serving as a sentiment evaluation model, determine if the given statement is 'positive' or 'negative'. Classify:\nQuestion: {content}\nAnswer:" 'original score': 0.9491803278688524 'attacked prompt': "which as a sentiment evaluation models, determine if the given words is 'positive' or 'negative'. model:\ncorrectly: {content}\nasking:" 'attacked score': 0.8885245901639345 'PDR': 0.06390328151986174

Immortalise commented 1 day ago

Hi, could you please indicate which model you are using for the attack? The difference may arise from the use of a different model.