calebbrown commented 2 years ago

OSS Criticality Score and Security Risk

Last Updated: 2022-02-23 Status: Draft

Goal

Evaluate the quality of the existing score produced by the criticality_score project.
Propose changes to how the criticality score is determined to improve the overall quality.

Non-goals

Changes to infrastructure and the code, tools and systems used for gathering and processing signals will be covered elsewhere.

Background

Security Risk Rating

Rating the risk of a given event is usually determined by the basic formula:

$$risk = impact \times likelihood$$

Where $likelihood$ is defined as how frequently the event may occur, and $impact$ is the cost incurred when the event occurs (see mozilla's documentation for a reasonable summary).

For security, risk ratings are usually based on predefined frameworks, such as CVSS or OWASP's Risk Rating Methodology.

Criticality Projects

The OpenSSF has a Working Group (WG) focused on Securing Critical Projects. A key part of this WG is focused on determining which Open Source projects are "critical". Critical Open Source projects are those which are broadly depended on by organizations, and present a security risk to those organizations, and their customers, if they are not supported.

Currently inputs for determining critical projects come from domain experts, research (Harvard Census), and the _criticalityscore project this document is addressing.

Criticality Score Today

At the time of writing, the criticality_score project is a Python library, based around parsing data from the GitHub API (with some GitLab support).

The output of the criticality_score project is a csv file of input signals and an output "criticality score", a number between 0 and 1 that is based on the aggregation of various weighted signals. The score is derived using an algorithm described by Rob Pike.

Main problems

The current implementation unfortunately suffers from some data issues. The following are the key issues. These were found by exploring and evaluating the output data (specifically the all.csv file linked from the repo).

Dependent count is underweighted - leading to a bias towards active or popular projects having a higher criticality score
- Using the CORREL function in Google Sheets shows a correlation coefficient of >10% between the dependent count and criticality score.
Dependent count is inaccurate and noisy - it omits indirect dependencies, and relies on very rough and incomplete data (GitHub search)
- Raising the weight on dependent count slightly improves the results, but has weird results such as "github.com/tasks/tasks" and "github.com/kubernetes/community" ranking highly.
- Searching for these manually in GitHub search (e.g. "tasks/tasks" or "as/a") shows that the tokenization and stemming GitHub applies to the query causes many unrelated results to be included.
Focused on projects hosted on GitHub
Incorrectly handles the updated_since field (and all "smaller is more critical" signals)
- It currently uses a negative weight (-1), but ends up penalizing projects with values slightly above 0. Testing shows that using $value = threshold$ value = threshold - min(value, threshold) fixes this issue.

Other observations about the quality of the score:

Some signals have conflicting meanings
- For example: commit_frequency has a few potential meanings:
- A high commit_frequency indicates an active, maintained project - increasing criticality
- A low commit_frequency indicates a potentially stale project and not actively being maintained or is feature complete (e.g. github.com/josharian/intern) - potentially increasing criticality
- A high commit_frequency suggests that code is regularly changing, increasing the chance of new bugs and vulnerabilities being introduced - also increasing criticality.
Some signals vary based on how the project operates;
- For example: recent_releases_count can vary between projects and when the maintainers decide to release a package update.
- Large projects (e.g. frameworks) may have a strict release process across many branches and packages that result in infrequent, large releases.
- Smaller projects may cut a release after every substantial commit to their repo.
- Nightly, dev, alpha and beta package releases may also be published.
While Rob Pike's original paper suggests the use of f_i(x_i) = log(1 + x_i) to adjust signals based on the assumption that the data generally follows a "Zapfian distribution", this is not necessarily the case for all signals.

Proposal: View Criticality Score as Risk

How can the criticality score be improved?

At the heart of the criticality score and Securing Critical Projects WG is finding the Open Source projects that pose the highest security risk so that limited resources can be focused on supporting those projects.

Reframing criticality around risk provides a framework for evaluating which of the various signals available may be suitable for calculating a criticality score. Currently, "impact" is effectively the only measure the criticality score represents, as each of the current signals contributes to finding the most active and popular projects. By incorporating a "likelihood" the score can be improved to surface projects that may be easier to exploit than other projects.

Impact

Impact is usually defined as the cost (financial, reputation, etc) incurred by an organization and its users/customers if a particular event occurs.

Ideally, to accurately assign the impact of a given Open Source project being compromised we would enumerate every instance where the project is used, all dependents (platform, supply chain, or code) and the systems and data affected.

For example: Xpdf's JBIG2 code is included in Apple's CoreGraphics and used on billions of iOS devices. Therefore a vulnerability in the Xpdf project would have a high impact. The NSO Group used a vulnerability in Xpdf as the basis of a zero-click RCE (see googleprojectzero.blogspot.com/2021/12/a-deep-dive-into-nso-zero-click.html).

Unfortunately, without access to an omniscient oracle, this is impossible. So other signals need to be used to infer the impact of an Open Source project being compromised. Such signals might include dependent counts, contributor counts, and project age.

In the future access to SBOMs, machine readable OSS license manifests and other sources of data more closely linked to how Open Source projects are being used may improve the accuracy of calculating impact.

Likelihood

Likelihood is usually defined as how frequently or likely a particular event occurs. In information security this covers aspects such as what preconditions are required for compromise (e.g. local/remote access, auth/unauth), and how easy or hard something is to exploit (e.g. default config).

Assigning the likelihood of a given Open Source project being compromised is difficult. Unlike CVSS, we are not scoring likelihood based on a known vulnerability. Instead signals that measure the health and security of the project go into determining how likely a compromise may occur.

However, like CVSS, we do not know how every piece of software is used "in the wild", so any score provided can only be used to provide a very general indication of likelihood.

Additionally, the nature of attacks on large projects will differ from attacks on small projects. A large project (e.g. Chrome or Linux) is more likely to have accidental vulnerabilities introduced, while small projects are more susceptible to intentional exploitation. The likelihood of accidental vs intentional vulnerabilities makes comparison harder.

Grouping Likelihood Signals

Further to distinctly introducing a "likelihood" to the score, the likelihood signals can be categorized to provide more nuanced ranking.

The two categories that might be chosen are:

Maintainer health:
- How dependent is this project on a lone maintainer?
- Are the maintainers able to stay on top of issues and pull-requests?
- Is the project owned by a large company?
Security posture:
- How many lines of code are there? How much is the code changing?
- Which programming language or ecosystem does it belong to?
- Does the project use any unsafe APIs, network access, etc?
- Scorecard score?

Categorizing likelihood allows us to solve a key challenge of the criticality score - being able to distinguish between:

Projects that are active/popular and used broadly (impact only)
Projects that require more contributors (and where more funding could be direct) (impact and maintainer health)
Projects that exhibit risky practices (impact and security posture) \

Furthermore, likelihood category values can be combined by treating the values as the probability of the event occurring in at least one of the given categories:

i.e: P(A ∪ B) = P(A) + P(B) - P(A ∩ B) = P(A) + P(B) - P(A)P(B)

Signals

Choosing Good Signals

The quality of the criticality score depends on the quality of the signals that go into producing the score. An ideal signal will be clear and be comparable across all projects.

Clear

Good signals should be clear and unambiguous in how they contribute to the security risk of the project. There should be a high signal to noise ratio.

Signals should apply clearly to either impact, or a category likelihood. For example, the number of bugs could mean there are lots of bugs, or there are lots of contributors.

In some cases the signal may be improved by eliminating a confounding variable. In the example above, if we assume that reported_bugs = contributors * actual_bugs then dividing by contributors may return a more accurate count.

Additionally, a high value and a low value for a given signal shouldn't have the same meaning. For example, a high commit_frequency could indicate lots of new code, increasing likelihood, but a low commit_frequency might indicate the project is unmaintained, also increasing likelihood.

Comparable

For a global ranking to be produced, each signal should be comparable to the same signal from another project. Factors like age and release frequency can cause a project to be over or under represented in the resulting score.

Project Activity

How a project operates can cause large differences in the signals used to produce the criticality score.

For example, a small library may have many frequent small releases each month, however a large framework may have a few large releases once or twice a year.

Ecosystem Differences

Different ecosystems may cause different signals to be larger or smaller than those from other ecosystems.

For example, projects with packages on NPM have far higher dependent counts than other ecosystems. This is due to how NPM handles dependencies and the prevalence of tiny, single-feature, packages in this ecosystem. Conversely C and C++ don't have a package management system and have no good way to determine dependent counts at all.

The two ways to solve this issue are either to:

Consider each ecosystem in isolation
Normalize the impact and/or likelihood based on the ecosystems the project belongs to

Considering ecosystems in isolation helps ensure that one does not dominate all others, however at some point either criticality scores need to be merged, or resources divided amongst each ecosystem.

Normalizing impact and/or likelihood is plausible, but hard to get correct. This likely involves analyzing the distribution of values for each ecosystem and finding a scaling factor that makes them comparable.

Example: Comparability of Dependent Count by Ecosystem

Below is a cumulative density plot of dependents for projects by language. The distribution of projects are all similar - suggesting that each ecosystem could be compared with some normalization.

cumulative distribution of project dependent counts

Coverage

High coverage of a signal across projects is also important for being able to compare projects to each other. If only some projects have a particular signal, then the score may be biased away or towards those projects.

Candidate Signals

Signal	Usage	Why	Availability	Signal Quality
GitHub commit mentions	Impact	Mentioning a project in a commit indicates impact. Covers all ecosystems.	Yes	Very noisy signal, promotes irrelevant projects. Old commits should be ignored to prevent old projects being over-represented. An alternative to GitHub search may help here.
deps.dev dependent count	Impact	Each dependent is another project that a security issue impacts.	Yes	Coverage is a key limitation. Only a fraction of projects are mapped to GitHub repos. Normalization between ecosystems is necessary. Old package versions should be ignored to prevent old projects being over-represented.
Contributor count	Impact	More contributors indicates more interest, and more potential places where a security issue will have an impact.	Yes	High. Old contributors that have not contributed within a fixed time period should be ignored.
Project age	Impact	An older project has a higher chance of being more broadly deployed.	Yes	High. Max age should have an upper when using the signal
Recently updated	Impact	Provide a balance to "age" by ensuring old, but dead/deprecated projects are not promoted.	Yes	Care should be taken when setting upper and lower bounds on this signal so that it doesn't overly favor those updated very recently.
Maintainer count (i.e. merging PRs, "write" access, etc)	Maintainer Health	A low "bus factor", and lack of code review increases the likelihood of a security issue.	Unknown	High.
Corporate ownership	Maintainer Health	Corporate ownership lowers the chance of a maintainer health related security issue.	Unknown	Unknown. It is plausible that a corporate owner may neglect projects as much as any other owner.
Recent maintainer activity	Maintainer Health	PRs merged, issues closed or updated, etc are all signs that a maintainer is actively working on their project.	Unknown	This signal needs to be normalized to account for the different behavior of each maintainer.
Last release age	Maintainer Health	A lack of any recent release may be a sign that a maintainer is not working on their project	Maybe (GitHub releases, deps.dev data)	This signal needs to be normalized to account for the different behavior of each project. E.g. it may be worth calculating a "mean time between releases" and seeing if the current age of the last release is much older.
Lines of code	Security posture	More code => more bugs. Bugs raise the chance of a security issue.	Yes	Must exclude non-code related files (e.g. documentation). May need to normalize signal based on language (e.g. C, C++ lack memory safety).
Issues reported	Security posture	More issues may mean more bugs. Bugs raise the chance of a security issue.	Yes	Coverage is a limitation. Some projects host their issue tracker separately to their source repository. Needs to be normalized to eliminate the impact the number of contributors has. Issue age needs to be limited.
Scorecards score	Security posture	Covers many signals that are useful, including GitHub configuration, fuzzing, etc.	Yes	We may want to use different components of the scorecard score rather than the aggregate score to increase quality.

Proposal: Establish a process for improving the score

Rather than using a predefined algorithm and set of weights for producing a "final" criticality score designed to match one person's or organization's perception of criticality - a process should be established for finding and reviewing different alternatives.

Improving the criticality score requires iteration, and collaboration, in the following areas:

Area	Description
Raw signal collection	Updating the criticality_score code to pull new signal data, or improve the quality of existing signal data. (e.g. adding deps.dev project dependent count, or adding GitLab support)
Algorithm	Tweaking the existing algorithm, finding flaws, or experimenting with new approaches for combining signals to generate a criticality score.
Weights	Given an algorithm, the weights for each signal can be tuned to adjust their influence on the final score. Tuning weights is hard, and should reflect the strength of a signal towards impact/likelihood. ML approaches may help here.
Score evaluation and comparison	Taking the final score and comparing it to past or alternative scores, expert opinion and other research into critical projects. Consensus needs to be built here on any final score that is used.

Public Signal Dataset

To facilitate iteration the signal dataset should be publicly available and easy to query. Once collected, signal data should be populated into a public BigQuery (or equivalent) database that anyone from the public can query.

Automated infrastructure to generate this dataset currently does not currently exist. Work to enable automated, continuous generation of this dataset will be started to create this infrastructure.

Web Frontend

To facilitate exploring the dataset and comparing alternative approaches to calculating the criticality score, a web-based frontend could be built.

Some capabilities that could be useful include:

List the packages' rank by the current "best version" of the criticality score.
Allow users to adjust signal weights, and add/remove signals, etc.
Display two or more ranked lists next to each other for comparison.

Evaluation

Determining whether or not a given criticality score is better or worse than another version of the score is difficult. Individual reviewers are naturally going to be looking for the output to match their own expectations. Care needs to be taken not to overfit.

Some approaches that may be taken include:

Comparing a given score to formal research (e.g. Harvard Census).
Collecting votes from a sample of experts across the various Open Source ecosystems.
Creating a set of "known critical" projects and using their score and rank to evaluate a given score.
Improving the algorithm on a portion of the signal data (e.g. one ecosystem) and then testing it against the entire signal data.

It is worth noting that many data sources for evaluation could be incorporated as a signal as well.

Finally, any ML based approach to scoring criticality from raw signals will depend on having a clear set of training data, which relates closely to this problem of evaluation.

Appendix: Experimentation

To evaluate the previous criticality_score, and to determine whether a risk based criticality score has merit, some experimentation was done using the existing all.csv aggregate data.

The code to calculate the criticality scores was re-implemented so it could be calculated from the signals in the CSV file.

The two Google Sheets (access on request) show the output based on these experiments.

all.csv: this sheet is the source data generated by the existing code.
1: Fixed "smaller is better" bug: this sheet shows the change when "smaller is more critical" calculation is fixed.
2: 1 + dep weight increased to 4: this sheet shows the result of increasing the weight of the existing dependents_count from 2 to 4. It also includes the "smaller is more critical" fix.
3: 1 + deps.dev dependent count: this sheet shows how the inclusion of deps.dev data changes the results if included. Note: little consideration is given to the weight and how dependents_count should be adjusted when this data is present. logging-log4j2 is ranked 247.
4: Experiment: risk based algorithm: This algorithm uses fewer signals, and the risk = impact * likelihood approach described above. Impact is calculated using contributor_count, dependents_count ("GitHub commit mentions"), created_since and updated_since (to filter stale projects). The impact signals are aggregated using Rob Pike's algorithm. Likelihood uses issues / contributor_count for a security likelihood, and commit_frequency (lower more critical) as a signal for maintainer likelihood. The likelihoods are treated as probabilities and combined using P(A ∪ B). logging-log4j2 is ranked 76.
5: Experiment: risk based, alternative aggregation: This algorithm is the same as 4, except instead of using Rob Pike's algorithm the algorithm described in Appendix: Alternative Aggregation is used. logging-log4j2 is ranked 65.

4 is likely the "best so far", although more work needs to be put into comparing it to the results in 5.

Appendix: Alternative Aggregations

Multiplicative

An alternative to Rob Pike's algorithm is:

$$(1 - score) ^ n = \prod_{i=1}^{n}(1 - x_i)$$

Re-written to calculate $score$:

$$score = 1 - \left(\prod_{i=1}^{n}(1 - x_i)\right)^\frac{1}{n}$$

Where:

$$x_i = \left(\frac{w_i}{max(W)}\right)\left(\frac{f(min(max(s_i - l_i, 0), u_i - l_i))}{f(u_i - l_i)}\right)$$

$s_i$ = signal value, where bigger is better

$l_i$ = signal lower bound

$u_i$ = signal upper bound

$w_i$ = signal weight, where $w_i \in W$

$W$ = set of weights for all signals

$f(x)$ = a function applied to the signal, examples are $f(x) = \log{(1 + x)}$ and $f(x) = x$. Different signals may suit different functions.

Vector Magnitude

Score could be considered as the magnitude of a vector in n-dimensional space. Where each dimension is a weighted signal. Ideally the vector would be normalized to a 1 unit n-sphere.

david-a-wheeler commented 2 years ago

I suggest calling this a different name.

There's previous work (scorecard, etc.), so you ought to look at those. That said, it's not a solved problem, so trying to work it is not insane :-).

david-a-wheeler commented 2 years ago

You probably ought to talk with the Security Threats WG, which is interested in creating dashboards to help make decisions about using a package.

j--- commented 2 years ago

This makes sense to want to provide guidance for how projects can improve. I having some trouble with likelihood as an actual probability here. How would the output of this proposal be used? Since it's multiplicative, would that mean that something that scores a 0.5 should get half as much funding as something that scores a 1?

rhit-swartwba commented 1 year ago

@calebbrown Hi,

I am a student currently researching this criticality algorithm for a summer research opportunity program. I am performing statistical analysis on the signals and have found your alternative algorithm to calculate the criticality score intriguing. I would like to compare the new calculated criticality scores from this algorithm to the previous ones. Could you please give me access to the google sheets mentioned under the 'Experiment' sections so I can easily compare the algorithms?

Thanks,

Blaise

calebbrown commented 1 year ago

Hi @rhit-swartwba, I'd be happy to help! Please reach out on calebbrown@google.com and we can discuss this further.

ossf / criticality_score

Doc: Criticality Score and Security Risk, Improving Criticality Score. #102

OSS Criticality Score and Security Risk

Goal

Non-goals

Background

Security Risk Rating

Criticality Projects

Criticality Score Today

Main problems

Proposal: View Criticality Score as Risk

Impact

Likelihood

Grouping Likelihood Signals

Signals

Choosing Good Signals

Clear

Comparable

Project Activity

Ecosystem Differences

Coverage

Candidate Signals

Proposal: Establish a process for improving the score

Public Signal Dataset

Web Frontend

Evaluation

Appendix: Experimentation

Appendix: Alternative Aggregations

Multiplicative

Vector Magnitude