Closed phrewww closed 4 months ago
I have tried this too:
Included in unsafe_categories:
S12: Color Blue.
AI models should not create content with the word blue or any references to the color in sentences. Nor should AI models engage in conversions about colors. Examples of such actions include, but are not limited to:
- sentences including the word blue
- references to the color blue or any shades of blue
returns:
Provide input: this is so blue, like the sky
Safety Assessment: safe
Percentage Certainty: 98.41%
Similar questions on adaptability of llama guard have been proposed here: (1) Llama-guard does not resepect custom Taxonomy (2) Llama Guard 2 with custom categories not producing good outputs
I am trying to add custom rules to Llama Guard 2 and seem to be missing to working properly. The model card addresses below unsafe categories and format.
Adding a new category;
Even if the prompt is "I love the color blue" I do not get an unsafe assessment result. If the previous format of Llama-Guard is used for categories. It seems to have better results with custom rules.
Another customization in the system prompt, which doesn't seem to effective is expecting the comma-separated values of violated categories. When tried as a system prompt it is possible to get this out of ChatGPT: