Hierarchy of behaviours. Top-level nodes are currently Chat, Tasks, Meta, and Safety
Per behaviour type:
Behaviour ID An identifier for each policy. These are primary keys and should remain static
How do we structure these? I expect policies to float around a bit as the typology settles. Kinda happy to take a single-letter code from a set of four, followed by a three-digit 0-padded number - this assumes that policies won't move around category so much. Maybe we could add a final letter/single word for subpolicies (no need to start at "a").
Behaviour names A text name for each policy
Behaviour description Characterisation of the input/output covered by this behaviour
Behaviour example prompts At least two prompts that could test for the policy, which should get a mitigation/deflection message if the behaviour is not permitted
Connections to policies Probes, payloads, or even prompts should be connected to policy types; if there's a hit, the policy has been breached.
This groups policy scans: what will a model do without being attacked?