uw-nsl / SafeDecoding

Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
https://arxiv.org/abs/2402.08983
MIT License
101 stars 9 forks source link

Dataset used for finetuning the expert model #4

Closed SCccc21 closed 6 months ago

SCccc21 commented 6 months ago

Hi,

I'm currently reviewing the dataset detailed in Section A.5 of your paper and was wondering if it has been made available in an open-source format. Could you please provide a link or guide me on how to access it if it's available?

Additionally, I'm interested in understanding the resilience of the expert model described in your study. Specifically, has the expert model been directly evaluated for its performance against jailbreaking attacks? I would appreciate any insights or results you could share regarding this.

zhangchen-xu commented 6 months ago

Hi,

Sorry for the late reply (I don't know why there is no email notification for this issue). The dataset is already available in datasets/seed_reject.json. To fine-tune the model, you can run exp/finetune.py. It will first generate the responses to these seed questions, then generate the dataset and use LoRA for tuning an expert model. You can also use some famous sft libraries such as Axolotl, which already has perfect LoRA support.

Has the expert model been directly evaluated for its performance against jailbreaking attacks?

Answer: We use the generated prompt from the base model to test the defense performance. We showed that even if we use the attack prompts from the base model, the defense performance of the expert model is still not so good, especially for optimization-based attacks. The results can be found in Table 7 in the appendix.

SCccc21 commented 6 months ago

Thanks for your reply! I've checked seed_reject.json and there are only 36 samples in it. But in your paper, you collect '32 harmful queries spanning 16 harmful categories'. Is this full dataset available online? Or in your experiments, the expert model is fine-tuned on these 36 samples only?

zhangchen-xu commented 6 months ago

Ahhhh! Sorry, I just found I had a typo... It should be 36 harmful queries spanning 18 harmful categories! Yes, you are right, the expert model is fine-tuned on these 36 samples only, as we want our defense to be as lightweight as possible. Of course, we can release the dataset for each model!

SCccc21 commented 6 months ago

Thanks! Got it.