Open RobertKirk opened 9 months ago
Hi!
I'll add these details to the repo as soon as possible (and also to the paper since they are missing, thanks for pointing this out).
For the tags, we didn't use those in practice. Anyway, we collected them through manual annotation we did to better understand which kind of instructions were generated.
Seeds were generated manually. We used ~30 seeds.
In the paper in appendix B.2, you briefly describe how you generate the malicious instructions dataset. Could you share the prompt and seed instructions you used to generate this dataset? And how did you generate the
tags
column in the data here: https://github.com/vinid/instruction-llms-safety-eval/blob/main/data/evaluation/I-MaliciousInstructions.jsonRelatedly, where did you get the seed instructions from? Were they taken from the rephrased anthropic-HH safety instructions, or somewhere else?
Thanks in advance, very cool paper!