Open sheli-kohan opened 1 month ago
Thanks for flagging @sheli-kohan!
@albertodepaola Can you help take a look please?
hi,
It seems that the notebook correctly calls build_default_prompt(AgentType.USER, create_conversation(....), LlamaGuardVersion.LLAMA_GUARD_3.name). However, it looks like the prompt match to llama guard 2 format, which might be one source of the issue. You can check it here: https://github.com/meta-llama/llama-recipes/blame/main/src/llama_recipes/inference/prompt_format_utils.py#L61
i've tried to use PROMPT_INSTRUCTION as in Llama guard 3 model card but reach only AUCPR of 45%
@sheli-kohan Thank you very much for digging into the source and pointing this out! I will take a look.
i've tried to use PROMPT_INSTRUCTION as in Llama guard 3 model card but reach only AUCPR of 45%
Do you mean using the correct special tokens still doesn't give right result?
I've updated the prompt format to be compatible with Llama Guard 3, instead of Llama Guard 2.
I believe the other differences stem from the way parse_logprobs(prompts, type: Type)
calculates the class probabilities. Currently, it uses prompt["logprobs"][0][1]
for this calculation. However, I would expect the calculation to focus on the token 'safe', or if the token is unsafe, on the violative class number that appears after the token 'S'. or in case of binary classification on the 'unsafe' token.
but I didn't find references to how you calculate AUCPR on Llama guard paper.
The current use of prompt["logprobs"][0][1]
would only partially apply if I were still using Llama Guard 2.
I would appreciate your input on this. thanks, Sheli
Hi,
I encountered the same problem! Tried to reproduce the Llama Guard 3 evaluation results using the provided examples and got an AP of 24%. For me it looks like the model output is wrong, when compared to the GT labels.
Any help on this would be highly appreciated.
Thanks in advance,
M
cc @albertodepaola
@init27 when I used the "safe" and "unsafe" token probabilities + fixed the prompt to llama guard 3, I reached AUCPR of 50. still lower than presented 62 for llama guard 2.
Thanks for raising this issue @sheli-kohan . Indeed it does seem to be an issue with the notebook. We will endeavor to work out what's going on here, and hope to update the notebook in due course. Can I ask that you open a PR with the modifications you have made which have made an improvement so far. I'll discuss with colleagues how we can work out the issue here. Thanks again for looking at this. @tryrobbo (author of the LlamaGuard notebook)
Hi @tryrobbo , thanks for your assistance.
this is a PR with some bug fixes and additional category-wise evaluation code (zero shot only). keep in mind that the suggested code still does not reach the desired metrics.
A succces criteria to solve this bug are reaching these metrics:
System Info
Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.23.2 [pip3] torch==2.0.1+cu118 [pip3] torchaudio==2.0.2+cu118 [pip3] torchvision==0.15.2+cu118
Information
🐛 Describe the bug
I am currently evaluating the Llama Guard 3 model using the evaluation notebook provided in the llama-recipes repo: Llama Guard Customization via Prompting and Fine-Tuning.
When I ran the evaluation on the ToxicChat dataset, I observed an average precision of 30.20%. This was with the following configurations: split="test".
However, I noticed a discrepancy when comparing this result to the Llama Guard Model Card, which reports an average precision of 62.6%. even though the metric is referred to Llama Guard , I believe this degradation means some error in this notebook.
In other matter, we are failing to replicate paper results also for open ai mod eval dataset by category (figure 2 in the paper). if you'll be able to share the library or code you used for this evaluation that will be very helpful
Could you please provide any insights or guidance on this difference in performance?
Thank you for your time and assistance.
Best regards,
Sheli Kohan
Error logs
average precision 30.02%
Expected behavior
average precision 62.6%