Open Source Software Compromises Dataset

This is an effort to create a comprehensive dataset of open source software compromises. The intention is to help parties that want to prevent and mitigate open source software compromises.

All contributions are welcome. Initial effort will focus only on collecting data related to open source software compromises that happen after November 1, 2022. This is an experimental effort housed within the OpenSSF integrity working group

Inclusion Criteria, or What is an Open Source Software Compromise?

Compromises ought to be included in this dataset if both conditions (1) and (2) are met. “Compromise” implies that an attack has actually occurred.

Condition 1: The compromise arises from a vulnerability, introduced unintentionally or maliciously, in the open source software supply chain.

Condition 2: The compromise has a high impact. “High impact” means either “many” parties affected, especially parties associated with “critical infrastructure,” and/or the compromise results in “severe damage.”

Alternatively, vulnerabilities without an associated compromise ought to be included if the potential impact is vast and there is a high likelihood of undetected compromises, e.g. Heartbleed.

Who Is Responsible for Maintaining This Dataset?

This is a volunteer effort. There exist a set of maintainers that have personally volunteered their effort in the past to maintain separate, related datasets, and it’s therefore likely that this same group will also continue to devote time to maintaining this dataset. Others are welcome to contribute too. But it is a strictly volunteer effort.

What is the Recommended Timeline for Announcing Compromises?

This dataset is meant to capture only publicly reported compromises. So there must be publicly available information about the compromise. It is not intended to be a source of zero-days or otherwise undisclosed vulnerabilities.

How Do I Submit a Compromise?

Make a PR and place a new YAML file into the compromises folder. Each compromise is associated with one YAML file. This project uses a specific structure, described below, for data collection purposes. Fields marked as not required are optional. For an example YAML file, see the '1-dydx.yaml' file in the compromises folder.

Note on naming the file: Use id-name.yaml where id is an integer one greater than the highest existing id number and name is a short string somehow related to the attack.

Field	Required	Type	Description
compromise-name	Yes	string	A short, descriptive name for the attack. Err on the side of widely recognizable
description	Yes	string	Provide a description of the attack. Several sentences will often be adequate.
compromise-classification	Yes	string (can use a sequence)	Use attack class labels from the attack tree (https://arxiv.org/abs/2204.04008), in this paper, creating a separate label for each of the relevant nodes that apply to the attack, to the best of available knowledge. See attack-tree.md for a copy of the tree.
cwe	No	string	If applicable, add the appropriate CWE. https://cwe.mitre.org/
mitre-attack	No	string	If applicable, add the appropriate ATT&CK label. https://attack.mitre.org
ecosystem	Yes	string	The open source ecosystem associated with the attack, e.g. PyPI.
date-earliest-evidence-of-compromise	Yes	string	Appropriate formats include: YYYY-mm or YYYY-mm-dd
date-entry-was-created	Yes	string	YYYY-mm or YYYY-mm-dd
references	Yes	string (can use a sequence)	Any references, especially URLs, with information on attack.
malicious-intent	Yes	string	"yes" or "no"
packages-affected	Yes	string list	List of packages affected
IOCs	No	TBD	Should list rule name, rule type, and rule specification.

Note: This is an experimental effort. When you detect conceptual or pragmatic problems with the data fields, please raise them in an issue. Revising the data fields is a likely outcome of this initial effort.

ossf / oss-compromises

readme

Open Source Software Compromises Dataset

Inclusion Criteria, or What is an Open Source Software Compromise?

Who Is Responsible for Maintaining This Dataset?

What is the Recommended Timeline for Announcing Compromises?

How Do I Submit a Compromise?