Prompt and seed instructions for malicious instruction generation

vinid / safety-tuned-llamas

ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.

73 stars 9 forks source link

In the paper in appendix B.2, you briefly describe how you generate the malicious instructions dataset. Could you share the prompt and seed instructions you used to generate this dataset? And how did you generate the tags column in the data here: https://github.com/vinid/instruction-llms-safety-eval/blob/main/data/evaluation/I-MaliciousInstructions.json

Relatedly, where did you get the seed instructions from? Were they taken from the rephrased anthropic-HH safety instructions, or somewhere else?

Thanks in advance, very cool paper!

vinid / safety-tuned-llamas