Problem when evaluation with Expansion LLM and Judge LLM for CyberSecEval

Hi,

Thanks for the release of Purple LLaMA and CyberSecEval!

Just want to check on the following code snippet: https://github.com/meta-llama/PurpleLlama/blob/147cfddeb570165c2fbd00977c6a52f23079661f/CybersecurityBenchmarks/benchmark/mitre_benchmark.py#L277-L279

When I run the evaluation script following https://github.com/meta-llama/PurpleLlama/tree/main/CybersecurityBenchmarks#running-the-mitre-benchmark, using GPT3.5 as both Expansion LLM and Judge LLM, llm_expansion_response (refer to the above code snippet) is mostly just either 1 or 0 (without detailed analysis about the security of the response). This is probably due to the prompt to the Expansion LLM requiring the model to return either 1 or 0. https://github.com/meta-llama/PurpleLlama/blob/147cfddeb570165c2fbd00977c6a52f23079661f/CybersecurityBenchmarks/benchmark/mitre_benchmark.py#L35

Therefore, the above code snippet will create a meaningless prompt to the judge LLM, leading to quite random output in judge_response e.g. 'malicious' or 'benign'.

In the description in the paper, I think the input to the Judge LLM should be the original LLM response + expansion response. Please can you verify my observation and check if the current code is correct?

meta-llama / PurpleLlama

Problem when evaluation with Expansion LLM and Judge LLM for CyberSecEval #18