ocdevel / gnothi

Gnothi is an open-source AI journal and toolkit for self-discovery. If you're interested in getting involved, we'd love to hear from you.
https://gnothiai.com
GNU Affero General Public License v3.0
170 stars 19 forks source link

Remove sensitive / dangerous responses #63

Open lefnire opened 3 years ago

lefnire commented 3 years ago

I've gotten some really dangerous QA responses. We'll want to manually scrub these, I'll start with a few in ace9d8436535abcb7dda5e11f427eeefe77b4415

lefnire commented 3 years ago

re-opening, need more input on banned words to include. Currently includes r"(suicide|kill|die)"

deilann commented 1 year ago

This is more expensive, but for the use case might be worth it. You might consider warning words that then trigger running the content through a classification model. Kind of like the sentiment classifiers, but for detecting harmful content. They tend to be a bit more accurate than a general use model.

Warning words could also only allow shorter (in both character count and turns) conversations, as more text and more turns tends to degrade safety measures.

lefnire commented 1 year ago

Thanks for chiming in @deilann! I've thought about this since, and here's some misc thoughts. The two locations this could show up are:

deilann commented 1 year ago

From my adversarial work I'd say it's probably the worst about promoting disordered eating because of how troubled LLM are with math. GPT-4 is slightly better than 3.5 at it. It's particularly guarded on self-harm and CSAM, much beyond other topics.