Following this discussion about why the harmfulness reward models return high scores for PhysicalSafetySafe and Alpaca, we evaluated again with safepaca/absolute-harmfulness-predictor-redteam-osst and we got negative safety score for our model. Can you help us to confirm the meaning of negative number and why we got this result? Can we set it to zero or we need to change the function to calculate the safety score. Thank you very much.
Following this discussion about why the harmfulness reward models return high scores for PhysicalSafetySafe and Alpaca, we evaluated again with safepaca/absolute-harmfulness-predictor-redteam-osst and we got negative safety score for our model. Can you help us to confirm the meaning of negative number and why we got this result? Can we set it to zero or we need to change the function to calculate the safety score. Thank you very much.