Open zl-comment opened 1 month ago
Hi, could you please indicate which model you are using for the attack? The difference may arise from the use of a different model.
model='google/flan-t5-large'
Could you please check and compare the results here? In this website, the results for T5 in SST-2 dataset is around 95%.
bertattack
original prompt: Evaluate the sentiment of the given text and classify it as 'positive' or 'negative': original score: 0.4934426229508197 attacked prompt: Evaluate the sеntiment of the given text and classification it as 'positive' or 'negative': attacked score: 0.4852459016393443 PDR: 0.01661129568106314
original prompt: Given the context of this text, indicate if the emotion conveyed is 'positive' or 'negative': original score: 0.5 attacked prompt: Given the context of this text, indicatе if the emoton conveyed is 'positive' or 'negative': attacked score: 0.4737704918032787 PDR: 0.05245901639344264
original prompt: Analyze the tone of this statement and respond with either 'positive' or 'negative': original score: 0.5180327868852459 attacked prompt: Anaⅼyze the tone of this statement and respond with nor 'positive' or 'negative': attacked score: 0.460655737704918 PDR: 0.1107594936708861
original prompt: In the role of a sentiment analysis tool, respond with 'positive' or 'negative' to classify this statement: original score: 0.5229508196721312 attacked prompt: In the role of a sentiment analyse instrumentation, respond with 'positive' or 'negative' to classify this statemen𝚝: attacked score: 0.4639344262295082 PDR: 0.11285266457680258
original prompt: Functioning as a sentiment identification tool, assess if the following expression is 'positive' or 'negative'. Classify: original score: 0.5180327868852459 attacked prompt: Functionіng as a sentiment identіfication tool, assess if the following expression is 'positive' or 'negative'. Classify: attacked score: 0.4639344262295082 PDR: 0.10443037974683543
original prompt: Serving as a sentiment evaluation model, determine if the given statement is 'positive' or 'negative'. Classify: original score: 0.5278688524590164 attacked prompt: Serve as a sentimeոt evaluation modelling, determine if the given statement is 'positive' or 'negative'. Classify: attacked score: 0.4721311475409836 PDR: 0.10559006211180119
The original accuracy rate using the prompt words in the paper is only about 50%, but the original accuracy rate in the source code is about 90%. However, the original accuracy rate of prompt words using the sample codehttps://github.com/microsoft/promptbench/blob/main/examples/prompt_attack.ipynb is more than 90%
bertattack
'original prompt': "Evaluate the sentiment of the given text and classify it as 'positive' or 'negative':\nQuestion: {content}\nAnswer:" 'original score': 0.9459016393442623 'attacked prompt': "Evaluate the sincerity of the received text and rate it as 'positive' or 'negative':\ncuriosity: {content}\nAnswer:" 'attacked score': 0.9147540983606557 'PDR': 0.03292894280762571(0.9147540983606557~0.9557377049180328)
'original prompt': "Given the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':\nQuestion: {content}\nAnswer:" 'original score': 0.9393442622950819 'attacked prompt': "i the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':\nQuestion: {content}\nfield:" 'attacked score': 0.9229508196721311 'PDR': 0.017452006980802813(0.9229508196721311~)
'original prompt': "Given the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':\nQuestion: {content}\nAnswer:" 'original score': 0.9442622950819672 'attacked prompt': "Given the clarity of this section, specify if the emotion depicted is 'positive' or 'negative':\ncuriosity: {content}\nAnswer:" 'attacked score': 0.8885245901639345 'PDR': 0.05902777777777775
'original prompt': "Analyze the tone of this statement and respond with either 'positive' or 'negative':\nQuestion: {content}\nAnswer:" 'original score': 0.9475409836065574 'attacked prompt': "Analyze the syntax of this phrase and address with meaning 'positive' or 'negative':\nriddle: {content}\nAnswer:" 'attacked score': 0.6081967213114754 'PDR': 0.35813148788927335
'original prompt': "In the role of a sentiment analysis tool, respond with 'positive' or 'negative' to classify this statement:\nQuestion: {content}\nAnswer:" 'original score': 0.9540983606557377 'attacked prompt': "in the role of a sentiment analysis game, respond with 'positive' or 'negative' to classify this function:\ncuriosity: {content}\nquestions:" 'attacked score': 0.8918032786885246 'PDR': 0.06529209621993128
'original prompt': "Functioning as a sentiment identification tool, assess if the following expression is 'positive' or 'negative'. Classify:\nQuestion: {content}\nAnswer:" 'original score': 0.9475409836065574 'attacked prompt': "Functioning as a emotion correlation tool, study if the following expression is 'positive' or 'negative'. find:\ncuriosity: {content}\nAnswer:" 'attacked score': 0.8918032786885246 'PDR': 0.05882352941176468
'original prompt': "Serving as a sentiment evaluation model, determine if the given statement is 'positive' or 'negative'. Classify:\nQuestion: {content}\nAnswer:" 'original score': 0.9491803278688524 'attacked prompt': "which as a sentiment evaluation models, determine if the given words is 'positive' or 'negative'. model:\ncorrectly: {content}\nasking:" 'attacked score': 0.8885245901639345 'PDR': 0.06390328151986174