Check if discriminator can be trained directly on comments

msakarvadia / llm_bias

Investigating if we can find circuits in LLMs that reinforce human-biases found in training data

MIT License

0 stars 0 forks source link

Check if discriminator can be trained directly on comments #3

Open msakarvadia opened 4 months ago

msakarvadia commented 4 months ago

Right now, we are training the discriminator on the model's generated predication of a prompt -- this is causing us to exhaust memory. Can we train a discriminator to directly classify whether different writing sample's are from different demographic groups and use this to optimize the prompt?