protectai / rebuff

LLM Prompt Injection Detector
https://playground.rebuff.ai
Apache License 2.0
1.12k stars 80 forks source link

Add detection strategy module abstraction #13

Open woop opened 1 year ago

woop commented 1 year ago

Currently, the detection strategies are hard coded in the main application. All of the core logic in Rebuff should be contained within a single library, with detection strategies having their own abstraction. Ideally users can also contribute their own strategies.

ristomcgehee commented 1 year ago

Here's the approach I think I would take if I worked on this issue.

When initializing the SDK, the user configures zero or more strategies, where each strategy consists of one or more checks and the score threshold for each check. For example, a user could configure a "fast_and_cheap" strategy that includes the heuristic check, the vector store check, and GPT-3.5. Another configured strategy could be the "slow_and_thorough" strategy that makes multiple calls to GPT-4. One of the strategies will be marked as the default. Then at detection time, the client optionally picks one of the strategies to execute. If the client doesn't pick a strategy, the default one will be used. If the user does not configure a strategy when initializing the SDK, we'll enable a default strategy that is reasonably effective without being too slow.

I'd probably do this in 2 phases, where Phase 1 includes everything I described above and Phase 2 will add the ability to add custom checks.

I'd like to note that this would involve breaking changes to the API when invoking it for detection.

Another possible idea is to add different logic other than "trigger detection if any check fails". Perhaps a weighed voting system or a way to chain checks based on the results of other checks. But I think that can be done as a future improvement in a different issue.

Our code currently uses the term "check" which is what I've been using here, but I think a better term might be "tactic". The user would configure a collection of "tactics" to create a "strategy".

How does all that sound?