Add heuristics for adversarial suffixes

protectai / rebuff

LLM Prompt Injection Detector

https://playground.rebuff.ai

Apache License 2.0

1.13k stars 81 forks source link

Add heuristics for adversarial suffixes #58

Open seanpmorgan opened 1 year ago

seanpmorgan commented 1 year ago

Would be interesting to see what type of heuristics can be applied against adversarial suffixes. As background: https://arxiv.org/abs/2307.15043 https://github.com/llm-attacks/llm-attacks

To be clear, this wouldn't be a defense for all possible adversarial attacks. It does seem like we could screen some though.

ristomcgehee commented 1 year ago

What would you think about instead using a machine learning classifier? We could generate a list of several hundred or thousand adversarial suffixes, and then train a machine learning algorithm to classify text as adversarial vs non-adversarial. It would probably need to be a neural network in order to handle the complexities of language, but if it was non-transformer based, I would think it wouldn't have the same underlying weakness as the LLM. A determined attacker could still train a suffix generator that avoids our classifier, but it would significantly increase attacker costs.

It seems to me that coming up with heuristics would be quite challenging for this. I kinda think that one needs an understanding of the normal language in order to recognize a suffix as suspicious, and it would be laborious to try to encode that understanding manually.

ristomcgehee commented 1 year ago

After thinking about this more, perhaps a better approach would be to fine tune an existing LLM. The fine-tuned LLM could be trained to recognize a wide variety of prompt injection attacks, not just adversarial suffixes. I think fine tuning could help with situations where the attacker tries to prompt inject Rebuff itself. I expect that sooner or later, OpenAI will have some mechanism to share fine-tuned models. It also seems it might be possible to fine tune Llama 2 and distribute the modified weights.

seanpmorgan commented 1 year ago

So the issue with using a machine learning classifier here is that a gradient based adversarial input can be crafted to simultaneously trick the LLM and the classifier (especially since the model would be publicly available). We want to rely on more traditional "heuristics" to infer that a crafted input is suspicious. The current heuristics we have built in are pretty basic, but we can utilize more advanced grammar parsing etc.

That doesn't mean we can't use ML models as defense layers in general though, just that it's not the solution for gradient based attacks. I think it's a great idea in general and we can start working on #13 to support that and other modular defenses that we want to add.

ristomcgehee commented 1 year ago

Yeah, you're right that an adversary can trick both the LLM and the classifier. I'm just having trouble thinking of heuristics that might work against this sort of attack. Though maybe if I had more knowledge of more traditional NLP, I'd be able to think of some ideas.

I wonder if adversarial suffix attacks are similar to each other in vector space? Perhaps the vector similarity defense could help here.