uw-nsl / SafeDecoding

Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
https://arxiv.org/abs/2402.08983
MIT License
87 stars 5 forks source link

why is `output_expert` the output of the expert model? #3

Closed shanpoyang654 closed 5 months ago

shanpoyang654 commented 5 months ago

Thanks for your code!

I am reaching out to discuss some observations I've made while utilizing your codebase.

I apologize for my confusion, but I am having trouble understanding the purpose of the output_expert variable in your code. It appears to be a copy of the output_base variable, which is the output result of the original model. Could you please explain the reasoning behind this and why the output_expert variable is not simply a reference to the output_base variable?

Below is the code in defense.py for which I'm confused.

` inputs_duplicated = {k:v.repeat(2,1) for k,v in inputs.items()} # [2,length]

        outputs = self.model.generate(**inputs_duplicated,
                                adapter_names=self.adapter_names,
                                generation_config=gen_config,
                                pad_token_id=self.tokenizer.pad_token_id,
                                return_dict_in_generate=True,
                                output_scores=True,)

        output_base = copy.deepcopy(outputs)
        output_expert = copy.deepcopy(outputs)
        output_base.sequences = output_base.sequences[0].unsqueeze(0) # # [1,length]
        output_base.scores = output_base.scores[0][0].unsqueeze(0)
        output_expert.sequences = output_expert.sequences[1].unsqueeze(0)
        output_expert.scores = output_expert.scores[0][1].unsqueeze(0)`

Thank you for your time and consideration. Look forward to your advice and reply.

fly-dust commented 5 months ago

Hi,

Thanks for reaching out! I designed it like that because the data structure returned from model.generate, when we are using both base and expert models, is quite unusual. Basically, the outputs variable contains the outputs sequences and scores from both the original model and the expert model. So I copied the entire output, but I only kept the base part in output_base and the expert part in output_expert.

Sorry for this inelegant implementation that may lead to confusion. I may need to add some comments or make this part clearer.

shanpoyang654 commented 5 months ago

Hi,

Thanks for reaching out! I designed it like that because the data structure returned from model.generate, when we are using both base and expert models, is quite unusual. Basically, the outputs variable contains the outputs sequences and scores from both the original model and the expert model. So I copied the entire output, but I only kept the base part in output_base and the expert part in output_expert.

Sorry for this inelegant implementation that may lead to confusion. I may need to add some comments or make this part clearer.

Thank you for you kind reply! But I'm still confused why the outputs is not generated using the same self.model instance. I didn't see the difference between the original model and the expert model in this code below: outputs = self.model.generate(**inputs_duplicated, adapter_names=self.adapter_names, generation_config=gen_config, pad_token_id=self.tokenizer.pad_token_id, return_dict_in_generate=True, output_scores=True,)

Thank you for your time.

shanpoyang654 commented 5 months ago

Hi, Thanks for reaching out! I designed it like that because the data structure returned from model.generate, when we are using both base and expert models, is quite unusual. Basically, the outputs variable contains the outputs sequences and scores from both the original model and the expert model. So I copied the entire output, but I only kept the base part in output_base and the expert part in output_expert. Sorry for this inelegant implementation that may lead to confusion. I may need to add some comments or make this part clearer.

Thank you for you kind reply! But I'm still confused why the outputs is not generated using the same self.model instance. I didn't see the difference between the original model and the expert model in this code below: outputs = self.model.generate(**inputs_duplicated, adapter_names=self.adapter_names, generation_config=gen_config, pad_token_id=self.tokenizer.pad_token_id, return_dict_in_generate=True, output_scores=True,)

Thank you for your time.

In the paper, it says that the expert model is finetuned using parameter-efficient fine-tuning skills(like LoRA) from the original model. But I did not see the expert model in the code. Sorry for my confusion and thank you for your time and reply!

fly-dust commented 5 months ago

Both the original model and the expert model are loaded in defense.py (Line 99-108). PEFT (still not merged to the main branch) allows you to specify which LoRA adapter to use during the generation. For example, if adapter_names = ['base'], then it will return the generation from the base model. If adapter_names = ['base', 'expert'], it will return the generation from both the base model and the expert model. The reason for such a design is to save GPU memory during the inference (you don't need to load two separate models). You can also try to load two separate models.

We will make it clearer!

shanpoyang654 commented 5 months ago

Both the original model and the expert model are loaded in defense.py (Line 99-108). PEFT (still not merged to the main branch) allows you to specify which LoRA adapter to use during the generation. For example, if adapter_names = ['base'], then it will return the generation from the base model. If adapter_names = ['base', 'expert'], it will return the generation from both the base model and the expert model. The reason for such a design is to save GPU memory during the inference (you don't need to load two separate models). You can also try to load two separate models.

We will make it clearer!

Thank you very much! I got it! And thanks for your idea and code! Your paper gave me a lot of inspiration! Wish you all the best!