msakarvadia / llm_bias

Investigating if we can find circuits in LLMs that reinforce human-biases found in training data
MIT License
0 stars 0 forks source link

Obvious Model defenses #5

Open msakarvadia opened 2 months ago

msakarvadia commented 2 months ago

Can we tell a model "I am [insert identity here]" and get it to misclassify us.