Closed huu4ontocord closed 7 months ago
Hi!
Not sure I got this right but if you are asking why the harmfulness reward models return high scores for PhysicalSafetySafe and Alpaca, this is explained in Appendix C.2 (https://arxiv.org/pdf/2309.07875.pdf). Basically, we also had to train a second reward model with some general-purpose instruction data to balance out the fact that our first reward model was skewed towards identifying many things as unsafe.
Hope this helps!
Actually this makes a lot of sense. So we need to test with the other reward model to get the correct numbers.
Can you share the reward model? As an FYI, you used my project's reward model open assistant ... lol.
Is it one of these?
REDTEAM_MODEL = "safepaca/absolute-harmfulness-predictor-redteam" READTEAM_OSST_MODEL = "safepaca/absolute-harmfulness-predictor-redteam-osst"
eheh awesome to know!
Yea, that's right! the second one is using a small sample from the open assistant dataset to balance out the fact that the red teaming data in the anthropic dataset is by design skewed towards being red-teaming only.
Thank you!
resolved
In particular, we are seeing that PhysicalSafetySafe and Alpaca are high but the results seem pretty safe to us.
https://huggingface.co/aurora-m/aurora-m-v0.1-biden-harris-redteamed/blob/main/README.md