meta-llama / PurpleLlama

Set of tools to assess and improve LLM security.
Other
2.73k stars 453 forks source link

Precision/Recall Curves #34

Closed normster closed 4 months ago

normster commented 6 months ago

Thank you for releasing Llama Guard 2, it looks like a very promising model!

I was wondering if it would be feasible to release precision/recall curves or numbers by harm category, for your internal benchmark evaluation? Or is there any hope of publicly releasing a small labeled test set for the community to evaluate for ourselves?

From Table 2 in the model card, it looks like a classification threshold of 0.5 results in a rather high FNRs for some categories and I'd like to use a classification threshold with more balanced errors, but am not sure how to go about tuning it myself because the new MLCommons harm taxonomy doesn't map 1:1 with public content classification datasets like OpenAI's moderation dataset.

JFChi commented 4 months ago

Hi there,

We did not plot the precision/recall curves and don't have enough bandwidth to recompute the metrics.

I think what we can do here is (1) Use the beavertails test set to do the measurement. However, note that beavertails taxonomy is not fully aligned with our taxonomy and the annotation guideline could be different. Thus, the calibration might not be correct. (2) Since our taxonomy is aligned with MLCommons, MLCommons might release a test set for benchmarking in the future. Stay tuned for it.