Closed Yuuoniy closed 8 months ago
Hi there - happy to explain here. The reason you're seeing the ICD's insecure code rate observed so low is because origin_code
is just a code selection around a likely piece of insecure code, not the full context before and after. All test cases do correspond to a location in open-source codebases that were flagged with the ICD.
More broadly, I wanted to clarify that the intent explicitly was not to create a perfect dataset, but a comprehensive one that can evaluate an LLMs overall tendencies. From the paper:
While this result isn’t perfect, we believe that it’s sufficient to evaluate an LLM’s overall tendency towards generating insecure code across our hundreds of test cases
The large scale of our dataset was enabled by automations that were imperfect but high quality in general.
As noted in the paper, the ICD that was used to build these test cases has a 96% precision in identifying insecure coding practices (higher than what you note here, but still not 100%). This dosen't really affect the validity of the test cases themselves as the only goal there is to identify prompts that are likely to produce insecure outputs - the origin_code
is not part of the test case itself.
The test_case_prompt
itself was generated by an LLM which is sometimes not perfectly accurate but is in most cases (again a tradeoff in scaling the test case dataset).
Hope this clarifies your observations!
Hi, cynikolai Thank you very much for your reply.
I am interested in using this benchmark for testing and want to ensure that my understanding is correct.
From your reply, I understand that you used the ICD to scan the original open-source codebases. For each report of insecure code, you selected certain lines to form the origin_code
, (maybe by setting a fixed number of lines before and after the relevant code), is that correct?
In this context, does the 96% refer to the original detection of insecure code (before performing code selection), and not directly related to the origin_code
that was selected afterward? I'm asking this because the lower rate I mentioned is what I obtained by running ICD on the origin_code
.
I'm also curious about the test_case_prompt
: is it translated from the origin_code
, or does it originate from a broader source, such as the entire function where the 'origin_code' is located?
Thank you again for your assistance.
@Yuuoniy Yup, the ICD was run on the full file of the code, and the origin code is just a fixed window around the insecure code that is used to produce the test case prompt. This is necessary as we're using a fixed-length LLM to generate the test case prompt. Worth noting our goal here is just to capture prompts that would be likely to cause an LLM to produce insecure code - the actual test case result is simply whether the LLM response to the test case prompt produces insecure code.
Do note the 96% is in reference to the actual precision of the ICD itself (the rate at which the ICD produces false positives in detecting insecure code) - all origin code snippets in the dataset are mapped to an ICD detection.
Got it, thanks for clearing that up!
Dear authors,
Thank you for your important work. I am very interested in it and have been using the Cyberbenchmark to assess the code generation capabilities of some Large Language Models (LLMs). However, I have encountered some issues in the dataset compared to what is described in your paper, Purple Llama CyberSecEval: A benchmark for evaluating the cybersecurity risks of large language models
Specifically, according to the paper, the dataset located at
$DATASETS/instruct/instruct.json
, especially in theoriginal_code
field, should exhibit two main characteristics:100% Insecure Code Practices: The paper mentions the use of ICD (Insecure Code Detector) to identify insecure coding practices. Thus, I expected that the
original_code
in the instruction dataset would entirely consist of insecure code, as outlined in Section 2.2, "Constructing insecure code test sets". Here, insecure coding instances in the open-source code dataset are identified using ICD, and then an LLM translates relevant lines of code into a natural language instruction, forming the basis of the test case.Consistency and Completeness: The
original_code
should be consistent with thetest_case_prompt
, representing a complete and functionally correct function, given that thetest_case_prompts
are derived from theoriginal_code
.However, my findings show that the dataset in this repo (current version) does not meet these criteria. Specifically, I utilized the provided ICD code to test
original_code
in$DATASETS/instruct/instruct.json
and found the insecure code detection rate to be significantly less than 100%. Here are some of my results:Additionally, the
original_code
appear to be truncated and do not represent complete functions, leading to inconsistencies with thetest_case_prompt
. For instance, in the first entry ofinstruct/instruct.json
, the origin_code is truncated and does not match the detailed description in thetest_case_prompt
, as shown below. I am concerned that using such incomplete code as ground truth and assessing the generated code with the BLEU metric might not be appropriate.Could you please clarify if my understanding is correct or if there are other considerations I am missing?
Thank you in advance for your help and I'm looking forward to your response.