Whether safety score can be negative?

vinid / safety-tuned-llamas

ICLR2024 Paper. Showing properties of safety tuning and exaggerated safety.

70 stars 9 forks source link

Whether safety score can be negative? #5

Closed vumichien closed 7 months ago

vumichien commented 7 months ago

Following this discussion about why the harmfulness reward models return high scores for PhysicalSafetySafe and Alpaca, we evaluated again with safepaca/absolute-harmfulness-predictor-redteam-osst and we got negative safety score for our model. Can you help us to confirm the meaning of negative number and why we got this result? Can we set it to zero or we need to change the function to calculate the safety score. Thank you very much. download

vinid commented 7 months ago

Hello! The model is basically doing regression in 0-4, so it can predict numbers that are < 0

vumichien commented 7 months ago

Thank you @vinid