Open calebbrown opened 2 years ago
I suggest calling this a different name.
There's previous work (scorecard, etc.), so you ought to look at those. That said, it's not a solved problem, so trying to work it is not insane :-).
You probably ought to talk with the Security Threats WG, which is interested in creating dashboards to help make decisions about using a package.
This makes sense to want to provide guidance for how projects can improve. I having some trouble with likelihood as an actual probability here. How would the output of this proposal be used? Since it's multiplicative, would that mean that something that scores a 0.5 should get half as much funding as something that scores a 1?
@calebbrown Hi,
I am a student currently researching this criticality algorithm for a summer research opportunity program. I am performing statistical analysis on the signals and have found your alternative algorithm to calculate the criticality score intriguing. I would like to compare the new calculated criticality scores from this algorithm to the previous ones. Could you please give me access to the google sheets mentioned under the 'Experiment' sections so I can easily compare the algorithms?
Thanks,
Blaise
Hi @rhit-swartwba, I'd be happy to help! Please reach out on calebbrown@google.com and we can discuss this further.
OSS Criticality Score and Security Risk
Last Updated: 2022-02-23 Status: Draft
Goal
Non-goals
Changes to infrastructure and the code, tools and systems used for gathering and processing signals will be covered elsewhere.
Background
Security Risk Rating
Rating the risk of a given event is usually determined by the basic formula:
$$risk = impact \times likelihood$$
Where $likelihood$ is defined as how frequently the event may occur, and $impact$ is the cost incurred when the event occurs (see mozilla's documentation for a reasonable summary).
For security, risk ratings are usually based on predefined frameworks, such as CVSS or OWASP's Risk Rating Methodology.
Criticality Projects
The OpenSSF has a Working Group (WG) focused on Securing Critical Projects. A key part of this WG is focused on determining which Open Source projects are "critical". Critical Open Source projects are those which are broadly depended on by organizations, and present a security risk to those organizations, and their customers, if they are not supported.
Currently inputs for determining critical projects come from domain experts, research (Harvard Census), and the _criticalityscore project this document is addressing.
Criticality Score Today
At the time of writing, the criticality_score project is a Python library, based around parsing data from the GitHub API (with some GitLab support).
The output of the criticality_score project is a csv file of input signals and an output "criticality score", a number between 0 and 1 that is based on the aggregation of various weighted signals. The score is derived using an algorithm described by Rob Pike.
Main problems
The current implementation unfortunately suffers from some data issues. The following are the key issues. These were found by exploring and evaluating the output data (specifically the all.csv file linked from the repo).
updated_since
field (and all "smaller is more critical" signals)value = threshold - min(value, threshold)
fixes this issue.Other observations about the quality of the score:
f_i(x_i) = log(1 + x_i)
to adjust signals based on the assumption that the data generally follows a "Zapfian distribution", this is not necessarily the case for all signals.Proposal: View Criticality Score as Risk
How can the criticality score be improved?
At the heart of the criticality score and Securing Critical Projects WG is finding the Open Source projects that pose the highest security risk so that limited resources can be focused on supporting those projects.
Reframing criticality around risk provides a framework for evaluating which of the various signals available may be suitable for calculating a criticality score. Currently, "impact" is effectively the only measure the criticality score represents, as each of the current signals contributes to finding the most active and popular projects. By incorporating a "likelihood" the score can be improved to surface projects that may be easier to exploit than other projects.
Impact
Impact is usually defined as the cost (financial, reputation, etc) incurred by an organization and its users/customers if a particular event occurs.
Ideally, to accurately assign the impact of a given Open Source project being compromised we would enumerate every instance where the project is used, all dependents (platform, supply chain, or code) and the systems and data affected.
For example: Xpdf's JBIG2 code is included in Apple's CoreGraphics and used on billions of iOS devices. Therefore a vulnerability in the Xpdf project would have a high impact. The NSO Group used a vulnerability in Xpdf as the basis of a zero-click RCE (see googleprojectzero.blogspot.com/2021/12/a-deep-dive-into-nso-zero-click.html).
Unfortunately, without access to an omniscient oracle, this is impossible. So other signals need to be used to infer the impact of an Open Source project being compromised. Such signals might include dependent counts, contributor counts, and project age.
In the future access to SBOMs, machine readable OSS license manifests and other sources of data more closely linked to how Open Source projects are being used may improve the accuracy of calculating impact.
Likelihood
Likelihood is usually defined as how frequently or likely a particular event occurs. In information security this covers aspects such as what preconditions are required for compromise (e.g. local/remote access, auth/unauth), and how easy or hard something is to exploit (e.g. default config).
Assigning the likelihood of a given Open Source project being compromised is difficult. Unlike CVSS, we are not scoring likelihood based on a known vulnerability. Instead signals that measure the health and security of the project go into determining how likely a compromise may occur.
However, like CVSS, we do not know how every piece of software is used "in the wild", so any score provided can only be used to provide a very general indication of likelihood.
Additionally, the nature of attacks on large projects will differ from attacks on small projects. A large project (e.g. Chrome or Linux) is more likely to have accidental vulnerabilities introduced, while small projects are more susceptible to intentional exploitation. The likelihood of accidental vs intentional vulnerabilities makes comparison harder.
Grouping Likelihood Signals
Further to distinctly introducing a "likelihood" to the score, the likelihood signals can be categorized to provide more nuanced ranking.
The two categories that might be chosen are:
Categorizing likelihood allows us to solve a key challenge of the criticality score - being able to distinguish between:
Furthermore, likelihood category values can be combined by treating the values as the probability of the event occurring in at least one of the given categories:
i.e:
P(A ∪ B) = P(A) + P(B) - P(A ∩ B) = P(A) + P(B) - P(A)P(B)
Signals
Choosing Good Signals
The quality of the criticality score depends on the quality of the signals that go into producing the score. An ideal signal will be clear and be comparable across all projects.
Clear
Good signals should be clear and unambiguous in how they contribute to the security risk of the project. There should be a high signal to noise ratio.
Signals should apply clearly to either impact, or a category likelihood. For example, the number of bugs could mean there are lots of bugs, or there are lots of contributors.
In some cases the signal may be improved by eliminating a confounding variable. In the example above, if we assume that
reported_bugs = contributors * actual_bugs
then dividing bycontributors
may return a more accurate count.Additionally, a high value and a low value for a given signal shouldn't have the same meaning. For example, a high commit_frequency could indicate lots of new code, increasing likelihood, but a low commit_frequency might indicate the project is unmaintained, also increasing likelihood.
Comparable
For a global ranking to be produced, each signal should be comparable to the same signal from another project. Factors like age and release frequency can cause a project to be over or under represented in the resulting score.
Project Activity
How a project operates can cause large differences in the signals used to produce the criticality score.
For example, a small library may have many frequent small releases each month, however a large framework may have a few large releases once or twice a year.
Ecosystem Differences
Different ecosystems may cause different signals to be larger or smaller than those from other ecosystems.
For example, projects with packages on NPM have far higher dependent counts than other ecosystems. This is due to how NPM handles dependencies and the prevalence of tiny, single-feature, packages in this ecosystem. Conversely C and C++ don't have a package management system and have no good way to determine dependent counts at all.
The two ways to solve this issue are either to:
Considering ecosystems in isolation helps ensure that one does not dominate all others, however at some point either criticality scores need to be merged, or resources divided amongst each ecosystem.
Normalizing impact and/or likelihood is plausible, but hard to get correct. This likely involves analyzing the distribution of values for each ecosystem and finding a scaling factor that makes them comparable.
Example: Comparability of Dependent Count by Ecosystem
Below is a cumulative density plot of dependents for projects by language. The distribution of projects are all similar - suggesting that each ecosystem could be compared with some normalization.
Coverage
High coverage of a signal across projects is also important for being able to compare projects to each other. If only some projects have a particular signal, then the score may be biased away or towards those projects.
Candidate Signals
Covers all ecosystems.
Old commits should be ignored to prevent old projects being over-represented.
An alternative to GitHub search may help here.
Normalization between ecosystems is necessary.
Old package versions should be ignored to prevent old projects being over-represented.
Old contributors that have not contributed within a fixed time period should be ignored.
May need to normalize signal based on language (e.g. C, C++ lack memory safety).
Needs to be normalized to eliminate the impact the number of contributors has.
Issue age needs to be limited.
Proposal: Establish a process for improving the score
Rather than using a predefined algorithm and set of weights for producing a "final" criticality score designed to match one person's or organization's perception of criticality - a process should be established for finding and reviewing different alternatives.
Improving the criticality score requires iteration, and collaboration, in the following areas:
Tuning weights is hard, and should reflect the strength of a signal towards impact/likelihood. ML approaches may help here.
Consensus needs to be built here on any final score that is used.
Public Signal Dataset
To facilitate iteration the signal dataset should be publicly available and easy to query. Once collected, signal data should be populated into a public BigQuery (or equivalent) database that anyone from the public can query.
Automated infrastructure to generate this dataset currently does not currently exist. Work to enable automated, continuous generation of this dataset will be started to create this infrastructure.
Web Frontend
To facilitate exploring the dataset and comparing alternative approaches to calculating the criticality score, a web-based frontend could be built.
Some capabilities that could be useful include:
Evaluation
Determining whether or not a given criticality score is better or worse than another version of the score is difficult. Individual reviewers are naturally going to be looking for the output to match their own expectations. Care needs to be taken not to overfit.
Some approaches that may be taken include:
It is worth noting that many data sources for evaluation could be incorporated as a signal as well.
Finally, any ML based approach to scoring criticality from raw signals will depend on having a clear set of training data, which relates closely to this problem of evaluation.
Appendix: Experimentation
To evaluate the previous criticality_score, and to determine whether a risk based criticality score has merit, some experimentation was done using the existing all.csv aggregate data.
The code to calculate the criticality scores was re-implemented so it could be calculated from the signals in the CSV file.
The two Google Sheets (access on request) show the output based on these experiments.
risk = impact * likelihood
approach described above. Impact is calculated using contributor_count, dependents_count ("GitHub commit mentions"), created_since and updated_since (to filter stale projects). The impact signals are aggregated using Rob Pike's algorithm. Likelihood usesissues / contributor_count
for a security likelihood, and commit_frequency (lower more critical) as a signal for maintainer likelihood. The likelihoods are treated as probabilities and combined usingP(A ∪ B)
. logging-log4j2 is ranked 76.4 is likely the "best so far", although more work needs to be put into comparing it to the results in 5.
Appendix: Alternative Aggregations
Multiplicative
An alternative to Rob Pike's algorithm is:
$$(1 - score) ^ n = \prod_{i=1}^{n}(1 - x_i)$$
Re-written to calculate $score$:
$$score = 1 - \left(\prod_{i=1}^{n}(1 - x_i)\right)^\frac{1}{n}$$
Where:
$$x_i = \left(\frac{w_i}{max(W)}\right)\left(\frac{f(min(max(s_i - l_i, 0), u_i - l_i))}{f(u_i - l_i)}\right)$$
$s_i$ = signal value, where bigger is better
$l_i$ = signal lower bound
$u_i$ = signal upper bound
$w_i$ = signal weight, where $w_i \in W$
$W$ = set of weights for all signals
$f(x)$ = a function applied to the signal, examples are $f(x) = \log{(1 + x)}$ and $f(x) = x$. Different signals may suit different functions.
Vector Magnitude
Score could be considered as the magnitude of a vector in n-dimensional space. Where each dimension is a weighted signal. Ideally the vector would be normalized to a 1 unit n-sphere.