vinid / safety-tuned-llamas

ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.
70 stars 9 forks source link

Can you confirm lower is better for the evals? #4

Closed huu4ontocord closed 7 months ago

huu4ontocord commented 7 months ago

In particular, we are seeing that PhysicalSafetySafe and Alpaca are high but the results seem pretty safe to us.

https://huggingface.co/aurora-m/aurora-m-v0.1-biden-harris-redteamed/blob/main/README.md

vinid commented 7 months ago

Hi!

Not sure I got this right but if you are asking why the harmfulness reward models return high scores for PhysicalSafetySafe and Alpaca, this is explained in Appendix C.2 (https://arxiv.org/pdf/2309.07875.pdf). Basically, we also had to train a second reward model with some general-purpose instruction data to balance out the fact that our first reward model was skewed towards identifying many things as unsafe.

Hope this helps!

huu4ontocord commented 7 months ago

Actually this makes a lot of sense. So we need to test with the other reward model to get the correct numbers.

huu4ontocord commented 7 months ago

Can you share the reward model? As an FYI, you used my project's reward model open assistant ... lol.

huu4ontocord commented 7 months ago

Is it one of these?

REDTEAM_MODEL = "safepaca/absolute-harmfulness-predictor-redteam" READTEAM_OSST_MODEL = "safepaca/absolute-harmfulness-predictor-redteam-osst"

vinid commented 7 months ago

eheh awesome to know!

Yea, that's right! the second one is using a small sample from the open assistant dataset to balance out the fact that the red teaming data in the anthropic dataset is by design skewed towards being red-teaming only.

huu4ontocord commented 7 months ago

Thank you!

huu4ontocord commented 7 months ago

resolved