Open lefnire opened 3 years ago
re-opening, need more input on banned words to include. Currently includes r"(suicide|kill|die)"
This is more expensive, but for the use case might be worth it. You might consider warning words that then trigger running the content through a classification model. Kind of like the sentiment classifiers, but for detecting harmful content. They tend to be a bit more accurate than a general use model.
Warning words could also only allow shorter (in both character count and turns) conversations, as more text and more turns tends to degrade safety measures.
Thanks for chiming in @deilann! I've thought about this since, and here's some misc thoughts. The two locations this could show up are:
From my adversarial work I'd say it's probably the worst about promoting disordered eating because of how troubled LLM are with math. GPT-4 is slightly better than 3.5 at it. It's particularly guarded on self-harm and CSAM, much beyond other topics.
I've gotten some really dangerous QA responses. We'll want to manually scrub these, I'll start with a few in ace9d8436535abcb7dda5e11f427eeefe77b4415