Closed SCccc21 closed 6 months ago
Hi,
Sorry for the late reply (I don't know why there is no email notification for this issue). The dataset is already available in datasets/seed_reject.json
. To fine-tune the model, you can run exp/finetune.py
. It will first generate the responses to these seed questions, then generate the dataset and use LoRA for tuning an expert model. You can also use some famous sft libraries such as Axolotl, which already has perfect LoRA support.
Has the expert model been directly evaluated for its performance against jailbreaking attacks?
Answer: We use the generated prompt from the base model to test the defense performance. We showed that even if we use the attack prompts from the base model, the defense performance of the expert model is still not so good, especially for optimization-based attacks. The results can be found in Table 7 in the appendix.
Thanks for your reply! I've checked seed_reject.json
and there are only 36 samples in it. But in your paper, you collect '32 harmful queries spanning 16 harmful categories'. Is this full dataset available online? Or in your experiments, the expert model is fine-tuned on these 36 samples only?
Ahhhh! Sorry, I just found I had a typo... It should be 36 harmful queries spanning 18 harmful categories! Yes, you are right, the expert model is fine-tuned on these 36 samples only, as we want our defense to be as lightweight as possible. Of course, we can release the dataset for each model!
Thanks! Got it.
Hi,
I'm currently reviewing the dataset detailed in Section A.5 of your paper and was wondering if it has been made available in an open-source format. Could you please provide a link or guide me on how to access it if it's available?
Additionally, I'm interested in understanding the resilience of the expert model described in your study. Specifically, has the expert model been directly evaluated for its performance against jailbreaking attacks? I would appreciate any insights or results you could share regarding this.