Obvious Model defenses - Githubissues

msakarvadia / llm_bias

Investigating if we can find circuits in LLMs that reinforce human-biases found in training data

MIT License

0 stars 0 forks source link

Open msakarvadia opened 2 months ago

msakarvadia commented 2 months ago

Can we tell a model "I am [insert identity here]" and get it to misclassify us.