ossf / scorecard

OpenSSF Scorecard - Security health metrics for Open Source
https://scorecard.dev
Apache License 2.0
4.53k stars 495 forks source link

Feature: Change how scores are displayed #2466

Open pnacht opened 1 year ago

pnacht commented 1 year ago

Is your feature request related to a problem? Please describe. There's a discrepancy between how good a given score is and how it feels. A 7/10 feels like a passing grade at best, but it actually means a project is in the top ~10% of the most relevant projects (or top ~1% of all projects), for example.

There have been a few maintainers who are surprised to hear that they're actually doing a good job when they get a good score.

https://github.com/twbs/bootstrap/pull/37402#issuecomment-1312663601:

That being said, 7.2 is not good enough either

https://github.com/numpy/numpy/pull/22482#issuecomment-1296207355:

The badge gives a number, 6.2 in our case. I'm not sure many people know how to interpret that number - it feels like a low score

Describe the solution you'd like A score that feels as good as it actually is. My proposal would be to either replace or supplement the current final score (7/10) with the respective quantile (top x%). The badge should also display the result in quantiles instead of (or as well as) final scores.

This would make everyone (maintainers and users) more accurately understand how solid a project's security posture is.

Even the top projects would have a better experience: I wager some users currently see urllib3's 9.3 and think "wow, that's pretty good, but still clearly needs to improve something!", when their actual understanding should be "wow, this is the most secure open-source project out there!"

Personally, I'd be in favor of the quantile simply supplementing the final score, precisely because (for example) urllib3 might be the most secure open-source project out there, but that missing 0.7 does also point out there's room for improvement. In simple terms:

Additional context

A first issue may be that the histogram of project scores isn't very nuanced: it seems clear from the chart below that GitHub's defaults give projects a score around 4.5/10 (charts obtained via the public BigQuery data), so the ~1 million projects analyzed by Scorecards can basically be categorized as "did something to improve their security" (and are therefore "top ~1%" of projects) or "did something to weaken their security" (and are therefore "bottom ~1%" of projects).

quantile plot for all projects pinged by Scorecard

However, if we focus on "important" projects, the chart becomes much more useful:

quantile plot for most relevant projects pinged by Scorecard

Naturally, this chart is heavily influenced by how we define "important". For the chart above, I defined it as projects with a criticality_score > 0.5. This choice was completely arbitrary, and just so happens to include ~10,000 projects. Whether this cutoff is appropriate or whether criticality_score is the best tool is naturally something that can (should!) be discussed as well.

It is also worth mentioning that this curve is an almost perfect sigmoid, and therefore calculating the quantile would be quite straightforward, though the equation parameters may need to be updated over time (hopefully due to improving scores across the open-source ecosystem!):

comparison of relevant projects' quantile plot and an estimated sigmoid

(the vertical axis goes from -25 to 125 because the estimated curve goes slightly above 100 and below 0, but that should be easy to clamp)

laurentsimon commented 1 year ago

I like the idea. @spencerschrock @azeemsgoogle @naveensrinivasan wdut?

di commented 1 year ago

Since completely replacing the X/10 score might be disruptive, we might want to explore supplementing these scores with a percentile, like:

spencerschrock commented 1 year ago

Since completely replacing the X/10 score might be disruptive, we might want to explore supplementing these scores with a percentile, like:

  • 7/10 (top 90% percentile for this check)

Is this for, the badge, the result viewer, or the results themselves?

di commented 1 year ago

I'd say anywhere we display an X/10 score, we should do this as well -- we should file separate issues for the results viewer/badge as necessary.

pnacht commented 1 year ago

I'm not sure how valuable quantiles are for individual checks, especially given how many checks are "binary" (0 or 10). I also suspect (without looking at any data) that the distributions will be heavily skewed/distorted, which might lead to less nuanced quantiles (i.e. only have top 1% or top 99% quantiles).

In my initial proposal, I was actually only thinking of having quantiles for the final score, where we have a pretty reasonable ("normal-ish") distribution.

But yes, I'd then show these quantiles everywhere: the CLI output, the viewer, the badge.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open for 60 days with no activity.

raghavkaul commented 7 months ago

The OpenSSF Best Practices badge uses "Passing", "Silver", and "Gold" which is easy to see at a glance. libraries must pass all criteria at a level before moving on to the next level. A similar scheme for Scorecard might be: pass X probes for Silver, X + Y for gold, etc.

github-actions[bot] commented 5 months ago

This issue has been marked stale because it has been open for 60 days with no activity.