Closed xszheng2020 closed 6 months ago
Hi Xiaosen,
Thanks for interest in our work!
This is what we describe in the paper as self-transfer:
We leverage the adversarial suffix found by random search for a simpler harmful request as the initialization for RS on more challenging requests. We refer to this approach as self-transfer.
There is no script for this procedure: you just go to the log files of a run with the default initialization (such as https://github.com/tml-epfl/llm-adaptive-attacks/blob/main/attack_logs/exps_llama2-7b_plain_init.log
), extract an adversarial suffix (ideally, optimized for as many iterations as possible) that is successful on some example, and then insert it in main.py
as adv_init
.
Interestingly, this approach often leads to a transferable adversarial suffix (or, at least, a good starting for a subsequent run of random search), even though it's crafted for a single model and a single request.
I hope that helps.
Best, Maksym
Thank you so much for the timely and detailed reply!
Hi, @max-andr
Thanks for your great work.
Could you please share the script for searching the adv_init?
Like the one for llama-2