Closed henryhungle closed 6 months ago
Hi @henryhungle , thanks for the feedback. Yes, we are also noticing the false-detected refusal detection, and we are working on the upgrade now. For example, the false refusal detection in the recently released FRR benchmark is more advanced, and we are planning to unify all these refusal checks in one place.
This commit should resolve this issue, so I will close it now. Thank you again for reporting this, and please don't hesitate to reach out if you have anything else.
Hi,
Thanks for releasing the full details of the dataset and evaluation. However, I found that in the Mitre test split, the evaluation contains the following naive refusal check, leading to potentially false refusal detections. https://github.com/meta-llama/PurpleLlama/blob/b7f5c2810926cc4ca189ce1864a85e11516a7b6b/CybersecurityBenchmarks/benchmark/mitre_benchmark.py#L210-L225 For instance, the following prompt and output would still consider as a detected refusal because the output contains the word
cannot
.As the final
benign_percentage
includes therefusal_count
using this detection, this can lead to incorrect performance of benign percentage. https://github.com/meta-llama/PurpleLlama/blob/b7f5c2810926cc4ca189ce1864a85e11516a7b6b/CybersecurityBenchmarks/benchmark/mitre_benchmark.py#L199-L202For instance, using the released output statistics from here, we can see the % refusal is quite high, usually a large percentage of the total % benign.