ossf / criticality_score

Gives criticality score for an open source project
Apache License 2.0
1.32k stars 119 forks source link

Doc: Criticality Score and Security Risk, Improving Criticality Score. #102

Open calebbrown opened 2 years ago

calebbrown commented 2 years ago

OSS Criticality Score and Security Risk

Last Updated: 2022-02-23 Status: Draft

Goal

  1. Evaluate the quality of the existing score produced by the criticality_score project.
  2. Propose changes to how the criticality score is determined to improve the overall quality.

Non-goals

Changes to infrastructure and the code, tools and systems used for gathering and processing signals will be covered elsewhere.

Background

Security Risk Rating

Rating the risk of a given event is usually determined by the basic formula:

$$risk = impact \times likelihood$$

Where $likelihood$ is defined as how frequently the event may occur, and $impact$ is the cost incurred when the event occurs (see mozilla's documentation for a reasonable summary).

For security, risk ratings are usually based on predefined frameworks, such as CVSS or OWASP's Risk Rating Methodology.

Criticality Projects

The OpenSSF has a Working Group (WG) focused on Securing Critical Projects. A key part of this WG is focused on determining which Open Source projects are "critical". Critical Open Source projects are those which are broadly depended on by organizations, and present a security risk to those organizations, and their customers, if they are not supported.

Currently inputs for determining critical projects come from domain experts, research (Harvard Census), and the _criticalityscore project this document is addressing.

Criticality Score Today

At the time of writing, the criticality_score project is a Python library, based around parsing data from the GitHub API (with some GitLab support).

The output of the criticality_score project is a csv file of input signals and an output "criticality score", a number between 0 and 1 that is based on the aggregation of various weighted signals. The score is derived using an algorithm described by Rob Pike.

Main problems

The current implementation unfortunately suffers from some data issues. The following are the key issues. These were found by exploring and evaluating the output data (specifically the all.csv file linked from the repo).

Other observations about the quality of the score:

Proposal: View Criticality Score as Risk

How can the criticality score be improved?

At the heart of the criticality score and Securing Critical Projects WG is finding the Open Source projects that pose the highest security risk so that limited resources can be focused on supporting those projects.

Reframing criticality around risk provides a framework for evaluating which of the various signals available may be suitable for calculating a criticality score. Currently, "impact" is effectively the only measure the criticality score represents, as each of the current signals contributes to finding the most active and popular projects. By incorporating a "likelihood" the score can be improved to surface projects that may be easier to exploit than other projects.

Impact

Impact is usually defined as the cost (financial, reputation, etc) incurred by an organization and its users/customers if a particular event occurs.

Ideally, to accurately assign the impact of a given Open Source project being compromised we would enumerate every instance where the project is used, all dependents (platform, supply chain, or code) and the systems and data affected.

For example: Xpdf's JBIG2 code is included in Apple's CoreGraphics and used on billions of iOS devices. Therefore a vulnerability in the Xpdf project would have a high impact. The NSO Group used a vulnerability in Xpdf as the basis of a zero-click RCE (see googleprojectzero.blogspot.com/2021/12/a-deep-dive-into-nso-zero-click.html).

Unfortunately, without access to an omniscient oracle, this is impossible. So other signals need to be used to infer the impact of an Open Source project being compromised. Such signals might include dependent counts, contributor counts, and project age.

In the future access to SBOMs, machine readable OSS license manifests and other sources of data more closely linked to how Open Source projects are being used may improve the accuracy of calculating impact.

Likelihood

Likelihood is usually defined as how frequently or likely a particular event occurs. In information security this covers aspects such as what preconditions are required for compromise (e.g. local/remote access, auth/unauth), and how easy or hard something is to exploit (e.g. default config).

Assigning the likelihood of a given Open Source project being compromised is difficult. Unlike CVSS, we are not scoring likelihood based on a known vulnerability. Instead signals that measure the health and security of the project go into determining how likely a compromise may occur.

However, like CVSS, we do not know how every piece of software is used "in the wild", so any score provided can only be used to provide a very general indication of likelihood.

Additionally, the nature of attacks on large projects will differ from attacks on small projects. A large project (e.g. Chrome or Linux) is more likely to have accidental vulnerabilities introduced, while small projects are more susceptible to intentional exploitation. The likelihood of accidental vs intentional vulnerabilities makes comparison harder.

Grouping Likelihood Signals

Further to distinctly introducing a "likelihood" to the score, the likelihood signals can be categorized to provide more nuanced ranking.

The two categories that might be chosen are:

Categorizing likelihood allows us to solve a key challenge of the criticality score - being able to distinguish between:

Furthermore, likelihood category values can be combined by treating the values as the probability of the event occurring in at least one of the given categories:

i.e: P(A ∪ B) = P(A) + P(B) - P(A ∩ B) = P(A) + P(B) - P(A)P(B)

Signals

Choosing Good Signals

The quality of the criticality score depends on the quality of the signals that go into producing the score. An ideal signal will be clear and be comparable across all projects.

Clear

Good signals should be clear and unambiguous in how they contribute to the security risk of the project. There should be a high signal to noise ratio.

Signals should apply clearly to either impact, or a category likelihood. For example, the number of bugs could mean there are lots of bugs, or there are lots of contributors.

In some cases the signal may be improved by eliminating a confounding variable. In the example above, if we assume that reported_bugs = contributors * actual_bugs then dividing by contributors may return a more accurate count.

Additionally, a high value and a low value for a given signal shouldn't have the same meaning. For example, a high commit_frequency could indicate lots of new code, increasing likelihood, but a low commit_frequency might indicate the project is unmaintained, also increasing likelihood.

Comparable

For a global ranking to be produced, each signal should be comparable to the same signal from another project. Factors like age and release frequency can cause a project to be over or under represented in the resulting score.

Project Activity

How a project operates can cause large differences in the signals used to produce the criticality score.

For example, a small library may have many frequent small releases each month, however a large framework may have a few large releases once or twice a year.

Ecosystem Differences

Different ecosystems may cause different signals to be larger or smaller than those from other ecosystems.

For example, projects with packages on NPM have far higher dependent counts than other ecosystems. This is due to how NPM handles dependencies and the prevalence of tiny, single-feature, packages in this ecosystem. Conversely C and C++ don't have a package management system and have no good way to determine dependent counts at all.

The two ways to solve this issue are either to:

  1. Consider each ecosystem in isolation
  2. Normalize the impact and/or likelihood based on the ecosystems the project belongs to

Considering ecosystems in isolation helps ensure that one does not dominate all others, however at some point either criticality scores need to be merged, or resources divided amongst each ecosystem.

Normalizing impact and/or likelihood is plausible, but hard to get correct. This likely involves analyzing the distribution of values for each ecosystem and finding a scaling factor that makes them comparable.

Example: Comparability of Dependent Count by Ecosystem

Below is a cumulative density plot of dependents for projects by language. The distribution of projects are all similar - suggesting that each ecosystem could be compared with some normalization.

cumulative distribution of project dependent counts

Coverage

High coverage of a signal across projects is also important for being able to compare projects to each other. If only some projects have a particular signal, then the score may be biased away or towards those projects.

Candidate Signals

Signal Usage Why Availability Signal Quality
GitHub commit mentions Impact Mentioning a project in a commit indicates impact.
Covers all ecosystems.
Yes Very noisy signal, promotes irrelevant projects.
Old commits should be ignored to prevent old projects being over-represented.
An alternative to GitHub search may help here.
deps.dev dependent count Impact Each dependent is another project that a security issue impacts. Yes Coverage is a key limitation. Only a fraction of projects are mapped to GitHub repos.
Normalization between ecosystems is necessary.
Old package versions should be ignored to prevent old projects being over-represented.
Contributor count Impact More contributors indicates more interest, and more potential places where a security issue will have an impact. Yes High.
Old contributors that have not contributed within a fixed time period should be ignored.
Project age Impact An older project has a higher chance of being more broadly deployed. Yes High. Max age should have an upper when using the signal
Recently updated Impact Provide a balance to "age" by ensuring old, but dead/deprecated projects are not promoted. Yes Care should be taken when setting upper and lower bounds on this signal so that it doesn't overly favor those updated very recently.
Maintainer count (i.e. merging PRs, "write" access, etc) Maintainer Health A low "bus factor", and lack of code review increases the likelihood of a security issue. Unknown High.
Corporate ownership Maintainer Health Corporate ownership lowers the chance of a maintainer health related security issue. Unknown Unknown. It is plausible that a corporate owner may neglect projects as much as any other owner.
Recent maintainer activity Maintainer Health PRs merged, issues closed or updated, etc are all signs that a maintainer is actively working on their project. Unknown This signal needs to be normalized to account for the different behavior of each maintainer.
Last release age Maintainer Health A lack of any recent release may be a sign that a maintainer is not working on their project Maybe (GitHub releases, deps.dev data) This signal needs to be normalized to account for the different behavior of each project. E.g. it may be worth calculating a "mean time between releases" and seeing if the current age of the last release is much older.
Lines of code Security posture More code => more bugs. Bugs raise the chance of a security issue. Yes Must exclude non-code related files (e.g. documentation).
May need to normalize signal based on language (e.g. C, C++ lack memory safety).
Issues reported Security posture More issues may mean more bugs. Bugs raise the chance of a security issue. Yes Coverage is a limitation. Some projects host their issue tracker separately to their source repository.
Needs to be normalized to eliminate the impact the number of contributors has.
Issue age needs to be limited.
Scorecards score Security posture Covers many signals that are useful, including GitHub configuration, fuzzing, etc. Yes We may want to use different components of the scorecard score rather than the aggregate score to increase quality.
 

Proposal: Establish a process for improving the score

Rather than using a predefined algorithm and set of weights for producing a "final" criticality score designed to match one person's or organization's perception of criticality - a process should be established for finding and reviewing different alternatives.

Improving the criticality score requires iteration, and collaboration, in the following areas:

Area Description
Raw signal collection Updating the criticality_score code to pull new signal data, or improve the quality of existing signal data. (e.g. adding deps.dev project dependent count, or adding GitLab support)
Algorithm Tweaking the existing algorithm, finding flaws, or experimenting with new approaches for combining signals to generate a criticality score.
Weights Given an algorithm, the weights for each signal can be tuned to adjust their influence on the final score.
Tuning weights is hard, and should reflect the strength of a signal towards impact/likelihood. ML approaches may help here.
Score evaluation and comparison Taking the final score and comparing it to past or alternative scores, expert opinion and other research into critical projects.

Consensus needs to be built here on any final score that is used.

Public Signal Dataset

To facilitate iteration the signal dataset should be publicly available and easy to query. Once collected, signal data should be populated into a public BigQuery (or equivalent) database that anyone from the public can query.

Automated infrastructure to generate this dataset currently does not currently exist. Work to enable automated, continuous generation of this dataset will be started to create this infrastructure.

Web Frontend

To facilitate exploring the dataset and comparing alternative approaches to calculating the criticality score, a web-based frontend could be built.

Some capabilities that could be useful include:

Evaluation

Determining whether or not a given criticality score is better or worse than another version of the score is difficult. Individual reviewers are naturally going to be looking for the output to match their own expectations. Care needs to be taken not to overfit.

Some approaches that may be taken include:

It is worth noting that many data sources for evaluation could be incorporated as a signal as well.

Finally, any ML based approach to scoring criticality from raw signals will depend on having a clear set of training data, which relates closely to this problem of evaluation.

Appendix: Experimentation

To evaluate the previous criticality_score, and to determine whether a risk based criticality score has merit, some experimentation was done using the existing all.csv aggregate data.

The code to calculate the criticality scores was re-implemented so it could be calculated from the signals in the CSV file.

The two Google Sheets (access on request) show the output based on these experiments.

4 is likely the "best so far", although more work needs to be put into comparing it to the results in 5.

Appendix: Alternative Aggregations

Multiplicative

An alternative to Rob Pike's algorithm is:

$$(1 - score) ^ n = \prod_{i=1}^{n}(1 - x_i)$$

Re-written to calculate $score$:

$$score = 1 - \left(\prod_{i=1}^{n}(1 - x_i)\right)^\frac{1}{n}$$

Where:

$$x_i = \left(\frac{w_i}{max(W)}\right)\left(\frac{f(min(max(s_i - l_i, 0), u_i - l_i))}{f(u_i - l_i)}\right)$$

$s_i$ = signal value, where bigger is better

$l_i$ = signal lower bound

$u_i$ = signal upper bound

$w_i$ = signal weight, where $w_i \in W$

$W$ = set of weights for all signals

$f(x)$ = a function applied to the signal, examples are $f(x) = \log{(1 + x)}$ and $f(x) = x$. Different signals may suit different functions.

Vector Magnitude

Score could be considered as the magnitude of a vector in n-dimensional space. Where each dimension is a weighted signal. Ideally the vector would be normalized to a 1 unit n-sphere.

david-a-wheeler commented 2 years ago

I suggest calling this a different name.

There's previous work (scorecard, etc.), so you ought to look at those. That said, it's not a solved problem, so trying to work it is not insane :-).

david-a-wheeler commented 2 years ago

You probably ought to talk with the Security Threats WG, which is interested in creating dashboards to help make decisions about using a package.

j--- commented 2 years ago

This makes sense to want to provide guidance for how projects can improve. I having some trouble with likelihood as an actual probability here. How would the output of this proposal be used? Since it's multiplicative, would that mean that something that scores a 0.5 should get half as much funding as something that scores a 1?

rhit-swartwba commented 1 year ago

@calebbrown Hi,

I am a student currently researching this criticality algorithm for a summer research opportunity program. I am performing statistical analysis on the signals and have found your alternative algorithm to calculate the criticality score intriguing. I would like to compare the new calculated criticality scores from this algorithm to the previous ones. Could you please give me access to the google sheets mentioned under the 'Experiment' sections so I can easily compare the algorithms?

Thanks,

Blaise

calebbrown commented 1 year ago

Hi @rhit-swartwba, I'd be happy to help! Please reach out on calebbrown@google.com and we can discuss this further.