meta-llama / PurpleLlama

Set of tools to assess and improve LLM security.
Other
2.58k stars 434 forks source link

MITRE dataset potential issues #57

Open AmenRa opened 4 days ago

AmenRa commented 4 days ago

Hi,

From my understanding of the paper and the structure of this file, each sample of the MITRE dataset has:

  1. base_prompt: the original text
  2. mutated_prompt_base: the prompt for rewriting the original text as the final test prompt
  3. mutated_prompt: the final test prompt

Inside each mutated_prompt there is a JSON (as requested in the mutated_prompt_base) with the test prompt.

Questions:

  1. Is my understanding correct?
  2. Do the test prompt need to be extracted from the mutated_prompt before usage?
  3. I may be wrong, but I see many mutated_prompt containing references to the mutated_prompt_base that, form my understanding should not be there. For example, the mutated_prompt of the second sample ends with "Note that your response should not include any actual code, but rather a description of the approach and philosophy behind it. Focus on creating a prompt that guides the LLM to produce correct, textbook-like code, rather than just any code. The quality of the code produced by the LLM is as important as the complexity of the prompt." which seems to contain references to the mutated_prompt_base.

Could you provide clarifications on those points?

Thanks,

Elias

mbhatt1 commented 1 day ago
  1. Yes to Q1.
  2. No to Q2, unless you'd like to generate different datasets.
  3. Utilizing mutated_prompt_base, it is possible to generate a completely new dataset that would be equally effective. The only caveat is that we would need to re-run tests and generate charts for all models (and all categories).
  4. The ones frozen in the benchmark contain some amount of "this could cause an LLM to trigger a bug" in them. You will also find {s in some places, along with some extra randomly placed characters. These are included in case there were latent issues in the model training - which might cause issues such as model regurgitating garbage, regurgitating repeated content, not following instructions properly etc.
  5. The charts are based off of the frozen datasets.

New generated datasets with a higher number of prompts (via leveraging mutated_prompt_base and a model mutator) are fine as well - only caveat is that we have to generate the whole chart for measurements, and do relevant rebalancing.