Aliases, and how they are supposed to be used

nscuro commented 1 year ago

Hey OSV team, thanks for your great work!

We're currently looking at how we can correlate vulnerabilities that describe the same thing.

As per specification, OSV has the aliases field for this:

The aliases field gives a list of IDs of the same vulnerability in other databases, in the form of the id field.

At least in my interpretation, aliasing is a bidirectional relationship that also applies transitively.
If X aliases Y and Z, Y should also alias X, and Y should also alias Z. If they all describe the same thing, that should be a valid assumption.

However, in reality, we see that many vulnerability databases (ab-)use the OSV schema to publish advisories. In my understanding, a vulnerability would describe one defect, and that one defect only. Whereas an advisory can potentially refer to multiple vulnerabilities (as in "we patched all these vulnerabilities in version 1.2.3 of our package"). This appears to be a common thing for at least the Go, Rust, and (especially) Debian ecosystems in the OSV database. There are most likely more, but these have been the most obvious candidates to us.

For example, GO-2022-0586 presumably aliases four CVEs and four GHSAs:

These are four different vulnerabilities, with different CWEs, descriptions and severities. CVEs and GHSAs actually alias each other in pairs of two (GHSA-28r2-q6m8-9hpx aliases CVE-2022-30323, but not CVE-2022-26945 etc.):

Aliases of GO-2022-0586

In cases of advisories like this, the "aliases" are neither bidirectional (GHSA-28r2-q6m8-9hpx isn't really the same as GO-2022-0586), nor are they fully transitive (CVE-2022-26945 is not the same as CVE-2022-30323). If one was to attempt to find all aliases for GHSA-28r2-q6m8-9hpx here, traversing this graph would yield wrong results.

The Debian ecosystem especially has many of these scenarios, where one DLA can refer to loads of CVEs:

I have the feeling that OSV entries of type "advisory" (maybe such a distinction would be good to have?) should instead use the related field. Although I imagine this will be hard to enforce, and even harder to apply in an automated fashion.

Am I understanding aliasing in OSV correctly? Is this a data quality issue with the databases that use the OSV schema? Is there anything we can do about it?

oliverchang commented 1 year ago

Hey @nscuro !

Thanks for the very detailed issue!

Can you explain a bit more about the exact use case you're trying to achieve here with the OSV data? Are you trying to build your own graph representation?

For the OSV schema, we actively avoided trying to make an explicit distinction between a group of vulnerabilities (advisory) vs a single vulnerability to keep things simple. In terms of the end result we want to enable, it's the same -- the ability to identify which package versions are affected and which versions to update to.

How we envision a vulnerability scanner working with our data would be this:

Extract a list of packages and versions to query. Say this is just Package "Foo" at Version "1.0.0".
Query OSV and get the list of vulnerability entries that say "Foo" at "1.0.0" is vulnerable.
Use "aliases"/"related" to group them together for presentation, e.g. in a bug filed.
Suggest a fix/resolution such that all the entries in a single group agree.

Under this workflow, it seems to make sense to group all of the related vulnerabilities together, so users have the full context on what all the vulnerability sources say, and updates/remediation steps account for all relevant entries in the same group. The fact that some of these are "advisories" should not matter -- having them be split up would have the same effect.

If this is an issue of semantics and representation, we can certainly ask our data sources to use related instead in the cases where an entry they're exporting consists of multiple other vulnerabilities from a different source, and this is likely more correct. I think this would be a relatively easy ask for our current sources to adopt.

nscuro commented 1 year ago

Can you explain a bit more about the exact use case you're trying to achieve here with the OSV data?

Our use case is not primarily about recommending a fixed version to an end user ("updating to version X will resolve all these issues"), it's more about tracking risk, and making it transparent. So knowing which vulnerabilities are the same and which are not does matter to us.

We also have a VEX-like use case, where users (or machines) evaluate whether a project is actually affected by a vulnerability, and record their decision. Obviously we want to avoid redundant work being done, a decision should not have to be recorded for GHSA-28r2-q6m8-9hpx and CVE-2022-30323 separately, as they describe the same thing. On the other hand, we don't want the same decision being applied to different vulnerabilities (CVE-2022-30323 vs. CVE-2022-26945), because the exposure, attack vector, impact etc. may differ.

Approaching this use case the other way around, if a vendor provided a VEX document stating that their product is not affected by CVE-2022-30323, this should also be applicable to actual aliases like GHSA-28r2-q6m8-9hpx, but not CVE-2022-26945.

If this is an issue of semantics and representation, we can certainly ask our data sources to use related instead in the cases where an entry they're exporting consists of multiple other vulnerabilities from a different source, and this is likely more correct.

That'd be great!

oliverchang commented 1 year ago

Can you explain a bit more about the exact use case you're trying to achieve here with the OSV data?

Our use case is not primarily about recommending a fixed version to an end user ("updating to version X will resolve all these issues"), it's more about tracking risk, and making it transparent. So knowing which vulnerabilities are the same and which are not does matter to us.

We also have a VEX-like use case, where users (or machines) evaluate whether a project is actually affected by a vulnerability, and record their decision. Obviously we want to avoid redundant work being done, a decision should not have to be recorded for GHSA-28r2-q6m8-9hpx and CVE-2022-30323 separately, as they describe the same thing. On the other hand, we don't want the same decision being applied to different vulnerabilities (CVE-2022-30323 vs. CVE-2022-26945), because the exposure, attack vector, impact etc. may differ.

Got it, thanks for explaning! Are you thinking of recording VEX on a per package basis, such that users can transitively determine from the entire dependency graph if they're actually indirectly affected by a vulnerability?

Approaching this use case the other way around, if a vendor provided a VEX document stating that their product is not affected by CVE-2022-30323, this should also be applicable to actual aliases like GHSA-28r2-q6m8-9hpx, but not CVE-2022-26945.

If this is an issue of semantics and representation, we can certainly ask our data sources to use related instead in the cases where an entry they're exporting consists of multiple other vulnerabilities from a different source, and this is likely more correct.

That'd be great!

We'll start conversations here with Go, and fix up the Debian ones.

github-actions[bot] commented 2 months ago

This issue has not had any activity for 60 days and will be automatically closed in two weeks

nscuro commented 2 months ago

Commenting to signal that this issue is still relevant.

I am enlightened however to see there is a continuous effort to improve the situation :)

oliverchang commented 2 months ago

Commenting to signal that this issue is still relevant.

I am enlightened however to see there is a continuous effort to improve the situation :)

Thanks! removed the stale tags.

google / osv.dev

Aliases, and how they are supposed to be used #888