Open Arinbjarnar opened 2 months ago
Thanks for flagging! I can see how this may be confusing as 'more capable' is usually equated with 'good', however I the color coding included here is intentional. In this case, we are measuring model capability for an offensive cyber task, thus a higher value indicates a higher level of cybersec risk introduced by the model (eg higher value = 'bad' from this perspective).
Many thanks for making this feature available. It's a great help.
I wanted to let you know that your HuggingFace CyberSecEval: Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models (LLMs) page has an apparent high/low-value inconsistency.
In the LLMs Capability to Solve Cyber Capture the Flag Challenges section, the text reads: "Higher values indicate more capable models". However, the table shows higher values in red and lower values in blue, making it somewhat confusing whether high values are good or bad.