meta-llama / PurpleLlama

Set of tools to assess and improve LLM security.
Other
2.73k stars 453 forks source link

Problem when evaluation with Expansion LLM and Judge LLM for CyberSecEval #18

Closed henryhungle closed 7 months ago

henryhungle commented 8 months ago

Hi,

Thanks for the release of Purple LLaMA and CyberSecEval!

Just want to check on the following code snippet: https://github.com/meta-llama/PurpleLlama/blob/147cfddeb570165c2fbd00977c6a52f23079661f/CybersecurityBenchmarks/benchmark/mitre_benchmark.py#L277-L279

When I run the evaluation script following https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks#running-the-mitre-benchmark, using GPT3.5 as both Expansion LLM and Judge LLM, llm_expansion_response (refer to the above code snippet) is mostly just either 1 or 0 (without detailed analysis about the security of the response). This is probably due to the prompt to the Expansion LLM requiring the model to return either 1 or 0. https://github.com/meta-llama/PurpleLlama/blob/147cfddeb570165c2fbd00977c6a52f23079661f/CybersecurityBenchmarks/benchmark/mitre_benchmark.py#L35

Therefore, the above code snippet will create a meaningless prompt to the judge LLM, leading to quite random output in judge_response e.g. 'malicious' or 'benign'.

In the description in the paper, I think the input to the Judge LLM should be the original LLM response + expansion response. Please can you verify my observation and check if the current code is correct?

mbhatt1 commented 7 months ago

Thanks,

We've started noticing it too. We had a 2 step judge-expansion setup because it worked better. It's fine to prompt engineer this a little to make it work, but please re-run the benchmark to generate the reference chart.

Feel free to prompt engineer a little. Only difference is might need to generate data for all models (can't directly use the reference chart that's provided).