CybersecurityBenchmarks | Request for Open Source Code Dataset and Clarification on Regex Rule Creation for ICD

omerHofBGU commented 4 months ago

Hi,

Thank you for sharing this excellent project!

I am interested in contributing to its expansion by creating additional rules for insecure code and extending the 'instruct' and 'autocomplete' files.

Could you please share the open-source code dataset that you used to extract the insecure code using the predefined rules? I have not been able to locate it. This would help me verify my rules and export insecure code snippets to the respective files. While I understand that any open-source dataset could be used, I believe it would be more appropriate to use the original data you utilized.

Additionally, could you explain how you created the regex rules from MITRE? Specifically:

How did you select the 50 CWEs? While MITRE provides hundreds of CWEs, you focused on 50. Was there a specific logic or criteria used in their selection? If so, it would help me selecting the next most 'important' rules.
Were these rules manually crafted by a cybersecurity expert, or were they generated using an LLM?

Thank you!

l33tm3 commented 4 months ago

+1

cynikolai commented 3 months ago

Hi, thanks for your interest. The code dataset we used it massive but https://raw.githubusercontent.com/meta-llama/PurpleLlama/main/CybersecurityBenchmarks/datasets/third-party.txt is a list of all the repos we scraped there.

@csahana95 can speak more to CWE selection, but I believe this was a mix of priority, our ability to detect a CWE reliably, and diversity across the space.
The rules themselves in ICD, while not 100% precise, were curated by security experts, not simply LLM generated.

Thank you for your interest!

meta-llama / PurpleLlama

CybersecurityBenchmarks | Request for Open Source Code Dataset and Clarification on Regex Rule Creation for ICD #44