Clarification on Evaluation Results for Llama Guard 3

sheli-kohan commented 1 month ago

System Info

Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.2 [pip3] torch==2.0.1+cu118 [pip3] torchaudio==2.0.2+cu118 [pip3] torchvision==0.15.2+cu118

Information

[X] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

I am currently evaluating the Llama Guard 3 model using the evaluation notebook provided in the llama-recipes repo: Llama Guard Customization via Prompting and Fine-Tuning.

When I ran the evaluation on the ToxicChat dataset, I observed an average precision of 30.20%. This was with the following configurations: split="test".

However, I noticed a discrepancy when comparing this result to the Llama Guard Model Card, which reports an average precision of 62.6%. even though the metric is referred to Llama Guard , I believe this degradation means some error in this notebook.

In other matter, we are failing to replicate paper results also for open ai mod eval dataset by category (figure 2 in the paper). if you'll be able to share the library or code you used for this evaluation that will be very helpful

Could you please provide any insights or guidance on this difference in performance?

Thank you for your time and assistance.

Best regards,

Sheli Kohan

Error logs

average precision 30.02%

Expected behavior

average precision 62.6%

init27 commented 1 month ago

Thanks for flagging @sheli-kohan!

@albertodepaola Can you help take a look please?

sheli-kohan commented 4 weeks ago

hi,

It seems that the notebook correctly calls build_default_prompt(AgentType.USER, create_conversation(....), LlamaGuardVersion.LLAMA_GUARD_3.name). However, it looks like the prompt match to llama guard 2 format, which might be one source of the issue. You can check it here: https://github.com/meta-llama/llama-recipes/blame/main/src/llama_recipes/inference/prompt_format_utils.py#L61

i've tried to use PROMPT_INSTRUCTION as in Llama guard 3 model card but reach only AUCPR of 45%

init27 commented 4 weeks ago

@sheli-kohan Thank you very much for digging into the source and pointing this out! I will take a look.

i've tried to use PROMPT_INSTRUCTION as in Llama guard 3 model card but reach only AUCPR of 45%

Do you mean using the correct special tokens still doesn't give right result?

sheli-kohan commented 4 weeks ago

I've updated the prompt format to be compatible with Llama Guard 3, instead of Llama Guard 2.

I believe the other differences stem from the way parse_logprobs(prompts, type: Type) calculates the class probabilities. Currently, it uses prompt["logprobs"][0][1] for this calculation. However, I would expect the calculation to focus on the token 'safe', or if the token is unsafe, on the violative class number that appears after the token 'S'. or in case of binary classification on the 'unsafe' token. but I didn't find references to how you calculate AUCPR on Llama guard paper.

The current use of prompt["logprobs"][0][1] would only partially apply if I were still using Llama Guard 2.

I would appreciate your input on this. thanks, Sheli

MLRadfys commented 3 weeks ago

Hi,

I encountered the same problem! Tried to reproduce the Llama Guard 3 evaluation results using the provided examples and got an AP of 24%. For me it looks like the model output is wrong, when compared to the GT labels.

Any help on this would be highly appreciated.

Thanks in advance,

M

HamidShojanazeri commented 3 weeks ago

cc @albertodepaola

sheli-kohan commented 3 weeks ago

@init27 when I used the "safe" and "unsafe" token probabilities + fixed the prompt to llama guard 3, I reached AUCPR of 50. still lower than presented 62 for llama guard 2.

tryrobbo commented 2 weeks ago

Thanks for raising this issue @sheli-kohan . Indeed it does seem to be an issue with the notebook. We will endeavor to work out what's going on here, and hope to update the notebook in due course. Can I ask that you open a PR with the modifications you have made which have made an improvement so far. I'll discuss with colleagues how we can work out the issue here. Thanks again for looking at this. @tryrobbo (author of the LlamaGuard notebook)

sheli-kohan commented 2 weeks ago

Hi @tryrobbo , thanks for your assistance.

this is a PR with some bug fixes and additional category-wise evaluation code (zero shot only). keep in mind that the suggested code still does not reach the desired metrics.

A succces criteria to solve this bug are reaching these metrics:

At least 62 AUPRC on binary classification on Llama Guard 3
Category-wise evaluation that reach the zero-shot ones presented in Fig. 2 in Llama guard paper

Screenshot 2024-08-29 at 12 44 43

meta-llama / llama-recipes