protectai / rebuff

LLM Prompt Injection Detector
https://playground.rebuff.ai
Apache License 2.0
1.06k stars 73 forks source link

How should i understand canary word? #29

Closed Terrybthvi closed 1 year ago

Terrybthvi commented 1 year ago

How should I understand canary word? Is it a check token, or is it simply a keyword? The output of the large model cannot contain this keyword? If it represents a keyword, how do I verify it if I have multiple keywords?

Terrybthvi commented 1 year ago

Why log when a canary word leakage? What does canary word leak mean?

Terrybthvi commented 1 year ago

Why should such a canary word check be set up?

woop commented 1 year ago

Hey @Terrybthvi - The word is arbitrary. The idea is that the model should not leak the canary word. If we see it being leaked then we are pretty certain an attack has happened. We can then allow you to take corrective action in your application. We log back the user input when a leak has happened in order to improve the collection of known attacks.

Terrybthvi commented 1 year ago

Thanks