Add EPSS as type to severity

kurtseifried commented 1 year ago

I'd like to add EPSS (https://www.first.org/epss/) to the severity field, which is a form of severity (how likely is it going to be exploited).

One wrinkle: EPSS scores include:

epss : the EPSS score representing the probability [0-1] of exploitation in the wild in the next 30 days (following score publication) percentile : the percentile of the current score, the proportion of all scored vulnerabilities with the same or a lower EPSS score

The EPSS percentile should be included, and I think the percentile should be included, e.g. like an Olympic score if everything is 9.x then 9.9 and 9.8 are vastly different. So the format would be:

type: EPSS (it doesn't have a version currently AFAIK but it might in future, so no version specified currently) score field: EPSS/0.00043/0.06996

so the EPSS score and the percentile of where that specific result currently lays

This change is simple and I've submitted a PR.

kurtseifried commented 1 year ago

PR: https://github.com/ossf/osv-schema/pull/145

oliverchang commented 1 year ago

Do we have examples of users who are producing EPSS today?

jbmaillet commented 1 year ago

I'd like to add EPSS (https://www.first.org/epss/) to the severity field, which is a form of severity (how likely is it going to be exploited). [...] This change is simple and I've submitted a PR.

A step further, I would suggest, like first.org does, to also compute and use the localized percentile, for the subset of vulnerabilities considered. Quoting https://www.first.org/epss/articles/prob_percentile_bins:

Another consideration when working with percentiles is that they are based on every published CVE, and it is unlikely that any organization is dealing with every CVE. Therefore, percentile values may change for a given subset of vulnerabilities. For example, when a user considers only those vulnerabilities relevant to her network environment, the percentile values will change -- because the sample of total vulnerabilities will change. The EPSS probability will not change, but the relative position (ranking) of one vulnerability to another will very likely change.

This is more complicated, since it requires grabbing the gloabl EPPS/percentile first, and then from the locally relevant EPSS, recompute local percentiles per project / use case. Maybe this does not have its place in the schema since it would be computed by a tool, but in the end this information should have its place in an OSV document.

kurtseifried commented 1 year ago

Yes FIRST is currently producing data:

https://www.first.org/epss/data_stats

which I would like to include in the machine readable data provided by GSD. We're also looking at EPSS for non CVE data.

jbmaillet commented 1 year ago

We're also looking at EPSS for non CVE data.

I am very interesting to see how it goes. Unless I misunderstood the basis of EPSS, and/or there has been a fundamental change in the version 3 of the model recently released (with not much details about the changes):

EPSS are open data, but not open source. This may be their main drawback. The data are published by first.org, but produced by the Cyentia Institute (https://www.first.org/epss/faq).
The machine learning model that EPSS is constructed on is mostly, if not only, based on CVE data and CVE correlated sources (https://www.first.org/epss/model, "Data Architecture and Sources").

Thus EPSS for non CVE data would be both an exciting and unexpected development! Maybe you have any insights and/or shareable info?

kurtseifried commented 1 year ago

One comment: Chicken and Egg. Why do people use CVE? It exists. Why don't they use X? Apart from GSD/OSV efforts there isn't another source. I suspect once we support EPSS and have it in all the CVE data for example, people may begin to ask a) can you do this for other public data (like GSD) and b) can we do it, e.g. open up the model and.or c) let's make an open model and tweak it and see if we can do this...

So a good step forwards would be having EPSS available and machine readable in OSV.

andrewpollock commented 1 year ago

Related, as I work on adding the CVE CVSS data from records I'm converting to OSV for https://github.com/google/osv.dev/issues/783, I've wondered how many native OSV records are including this (I haven't done any research, just mentally flagged that I'd like to)

marco-silva0000 commented 3 months ago

epss : the EPSS score representing the probability [0-1] of exploitation in the wild in the next 30 days (following score publication)

would this not mean it changes every day? or if not, is there somewhere reporting how often does it usually change?

andrewpollock commented 3 months ago

/cc @jayjacobs

jayjacobs commented 3 months ago

(one of the creators and co-chair of EPSS SIG here)

Non-CVE: EPSS is 100% data driven and trained on vulnerabilities with a CVE. While we are always looking for ways to score non-CVE vulnerabilities it is nearly impossible to uniquely identify and correlate non-CVE vulnerabilities across multiple data sources. So scoring non-CVE (at least with an EPSS-like approach) is not happening soon.
Percentiles: the EPSS percentile is the proportion of CVEs at or below the current CVE. We added that because EPSS scores are skewed and generally have a low probability, so while a 5% probability of exploitation in the next 30 days seems low, 93% of CVEs are scored lower (a CVE scored at 5% is in the top 7% of scored CVEs). It was a way to convert the distribution of EPSS scores in a more uniform distribution for easier ranking and reference (probabilities are not intuitive).
open data/model: most of the data behind EPSS is pulled from open sources, the exploitation activity itself is generally from commercial sources and is not public. The model is "closed" but that has mainly because EPSS has been volunteer driven and opening a complex model like this would require a lot of up front work to describe/document all of the data preparation and most likely generate a lot of questions and support. It's also using xgboost, which has all sorts of complexities and there isn't a typical "this feature has a weight of X" that one may expect. each feature ends up with different weights depending on the presence of other features. the entire algorithm is trained on the daily exploitation activity EPSS has been collecting and nothing in the model is decided by people.
Score Changes: The EPSS score does have the ability to change daily. I look at this about a year ago at at that time about 1,000 CVES would have the EPSS shift day to day. But since the percentile is calculated with all other CVEs, the majority of the percentiles would shift daily (~70% of CVEs shifted slightly)
EPSS Version: The EPSS model does have a version and in the top of the CSV we put the date the model was trained as a version (we hope to be updated the model more frequently starting later this fall). Currently we talk about EPSS version 3 being the current version, but the top of the CSV says "v2023.03.01"
Data ingestion: EPSS scores are published once daily, generally by 1500GMT (depends on data processing time, this will be getting better "soon"), I would suggest pulling the CSV by date explicitly to know which day you are grabbing (there is a default URL that will redirect to the latest CSV file). This is discussed at the top of https://www.first.org/epss/data_stats. There is an API, but that is generally for grabbing individual CVE records and it generates a lot of traffic to check every CVE daily.

Did I miss anything? Happy to answer any other questions and apologies for the wall of text.

Quick edit: One of the concerns about EPSS is that it is volunteer driven and at least in theory could disappear. I am working on some EPSS things on the backend and some changes are coming, but I am 100% committed to keeping EPSS exactly what it is currently - EPSS scores are freely available and open for commercial use. I expect it will only be getting better and more reliable (and hopefully funded) over the next few years.

andrewpollock commented 2 months ago

@jayjacobs thanks, the wall of text seemed like a pretty nice summary to me...

So is there any value in OSV records having an EPSS severity type?

It seems to me that in order to have an EPSS score for an OSV record, first EPSS would have to support calculating them for OSV records?

I could see a scenario where OSV.dev could, for CVEs converted from the NVD and vulnerability records otherwise aliasing a CVE ID, incorporate EPSS data as published?

jayjacobs commented 2 months ago

I think the value of adding EPSS into the OSV record is that removes a second step for the consumer. They wouldn't have to go look it up on their own by hitting the EPSS API or downloading the CSV. it's like adding CISA KEV information for the convenience IMO.

EPSS automatically scores all of the CVEs every day and it only works for vulnerabilities with CVEs since that's how disparate data sources are aggregated (currently). It isn't set up to calculate them on non-CVE data since it's nearly impossible to correlate other data sources for non-CVE data, and it's run off of a rather specific set of features that would be difficult to accurately duplicate outside of the existing data collection efforts.

If you wanted to incorporate EPSS scores, I would suggest the following generic set of steps:

Using today's date, check when a file is published (around 2pm GMT daily, varies from amount of data that needs processing or errors, etc) using this URL: https://epss.cyentia.com/epss_scores-YYYY-mm-dd.csv.gz. You will get a 404 until the file publishes, and it will prevent you from getting the same data twice. Or you could hit the URL https://epss.cyentia.com/epss_scores-current.csv.gz which will return a 302 to the current file and just make sure it's a date (in the filename) you should process.
The CSV has three columns, the CVE ID, the EPSS probability and the percentile ranking (percent of CVEs with this or lower probability assigned). Most implementations will drop that third column and just show the probability. We store it to 5 decimal places, but most will round to 3 places and show as a percent (e.g. 0.57292 becomes 57.3%)
Update the OSV data locally with the new data.
I would suggest that you also include a date the EPSS score was pulled in case it goes stale or something, but that's optional and I haven't seen it in practice yet.

I think that'd be it. like I said in the previous post, most scores do not change day to day, so you could add some logic not to modify the score if it didn't change.

Hope that's helpful.

oliverchang commented 2 months ago

We had a discussion with some GitHub folks (@darakian @taladrane) earlier, and we came to the conclusion that it may not make sense to include EPSS in OSV.

This is because EPSS is keyed on CVE and produced by a single entity, while OSV takes a more federated approach and enables database owners to publish their own interpretation of vulnerabilities (which may or may not link back to a CVE). There are also going to be overlap between different databases for the same CVE (e.g. a Linux distro DB vs a language package DB), which adds to potential confusion here and potential mismatching EPSS scores between different sources.

A more minor point is that it may introduce a bit of churn for OSV records (with the modified date changing frequently).

It seems like the best way for users to consume EPSS while using OSV is to lookup the relevant aliased CVE against the source of truth (https://www.first.org/epss/data_stats)? This does still have that second step for consumers, but I wonder if we can make it easier via an aggregator like https://osv.dev somehow.

andrewpollock commented 2 months ago

@jayjacobs do you have any record churn statistics you can share?

I guess we can decouple support in the schema for the severity type from OSV.dev doing anything with EPSS (either at import time or aggregation time) and that would still allow downstream consumers to merge the values themselves if they wanted to, @oliverchang ?

ossf / osv-schema

Add EPSS as type to severity #144