Build typology of model behaviours

How are we going to store these?

Data structures involved:

Hierarchy of behaviours. Top-level nodes are currently Chat, Tasks, Meta, and Safety
Per behaviour type:
- Behaviour ID An identifier for each policy. These are primary keys and should remain static
- How do we structure these? I expect policies to float around a bit as the typology settles. Kinda happy to take a single-letter code from a set of four, followed by a three-digit 0-padded number - this assumes that policies won't move around category so much. Maybe we could add a final letter/single word for subpolicies (no need to start at "a").
- Behaviour names A text name for each policy
- Behaviour description Characterisation of the input/output covered by this behaviour
- Behaviour example prompts At least two prompts that could test for the policy, which should get a mitigation/deflection message if the behaviour is not permitted
Connections to policies Probes, payloads, or even prompts should be connected to policy types; if there's a hit, the policy has been breached.
- Does this mean all were breached? Or just one?

leondz / garak