vinid / safety-tuned-llamas

ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
73 stars 9 forks source link

Prompt and seed instructions for malicious instruction generation #3

Open RobertKirk opened 9 months ago

RobertKirk commented 9 months ago

In the paper in appendix B.2, you briefly describe how you generate the malicious instructions dataset. Could you share the prompt and seed instructions you used to generate this dataset? And how did you generate the tags column in the data here: https://github.com/vinid/instruction-llms-safety-eval/blob/main/data/evaluation/I-MaliciousInstructions.json

Relatedly, where did you get the seed instructions from? Were they taken from the rephrased anthropic-HH safety instructions, or somewhere else?

Thanks in advance, very cool paper!

vinid commented 9 months ago

Hi!

I'll add these details to the repo as soon as possible (and also to the paper since they are missing, thanks for pointing this out).

For the tags, we didn't use those in practice. Anyway, we collected them through manual annotation we did to better understand which kind of instructions were generated.

Seeds were generated manually. We used ~30 seeds.