meta-llama / PurpleLlama

Set of tools to assess and improve LLM security.
Other
2.73k stars 453 forks source link

CybersecurityBenchmarks | Request for Open Source Code Dataset and Clarification on Regex Rule Creation for ICD #44

Closed omerHofBGU closed 3 months ago

omerHofBGU commented 4 months ago

Hi,

Thank you for sharing this excellent project!

I am interested in contributing to its expansion by creating additional rules for insecure code and extending the 'instruct' and 'autocomplete' files.

Could you please share the open-source code dataset that you used to extract the insecure code using the predefined rules? I have not been able to locate it. This would help me verify my rules and export insecure code snippets to the respective files. While I understand that any open-source dataset could be used, I believe it would be more appropriate to use the original data you utilized.

Additionally, could you explain how you created the regex rules from MITRE? Specifically:

Thank you!

l33tm3 commented 4 months ago

+1

cynikolai commented 3 months ago

Hi, thanks for your interest. The code dataset we used it massive but https://raw.githubusercontent.com/meta-llama/PurpleLlama/main/CybersecurityBenchmarks/datasets/third-party.txt is a list of all the repos we scraped there.

Thank you for your interest!