Closed kurtseifried closed 2 years ago
Here's part of the problem, according to https://ossf.github.io/osv-schema/#id-modified-fields
The id field is a unique identifier for the vulnerability entry.
Obviously integers are not unique.
I think this should be the role of the aggregator (osv.dev, GSD etc) to disambiguate/link entries by aliases.
e.g. say I have a vulnerability entry:
id: BLAH-123
aliases: [CVE-2022-123, GHSA-foo]
The aggregator should be able to find all other instanecs of CVE-2022-123, GHSA-foo. These may also not all exist at the time of creating the BLAH-123 entry. i.e. at some point in the future, some other database may publish info about CVE-2022-123.
Note that the spec acknowledges that some databases may re-use existing IDs as their primary identifier:
In addition to those prefixes, other databases may serve information about non-database-specific prefixes. For example a language ecosystem might decide to use CVE identifiers to index its database rather than a custom prefix. The known databases operating without custom identifier prefixes are:
Why are we making the aggregators guess instead of being explicit, especially when it's just an integer (especially when it doesn't appear to be a public-facing system, I cannot reliably find these integers using Google search)?
A second clarifying question: can alias's be published in public OSV entries, where the alias is internal data that isn't searchable/viewable by anyone? If so we risk a real mess in aliases.
I'm not sure I would call it guessing -- an identifier can be "resolved" dynamically by aggregator, which should know about every single OSV-compatible database out there and generate up-to-date links to them.
Also, I think the reality is only CVEs are overloaded right now -- most non-CVE identifiers have a clear place they come from. The OSV schema also attempts to list known prefixes and their databases per https://ossf.github.io/osv-schema/#id-modified-fields.
A second clarifying question: can alias's be published in public OSV entries, where the alias is internal data that isn't searchable/viewable by anyone? If so we risk a real mess in aliases.
While we can't stop this, this is certainly not the intended use case for this.
Ok so for example from osv-data the file ASB-A-202025316.json:
{"id": "ASB-A-202025316", "aliases": [ "CVE-2021-30318", "2799342", "3042414"],
what are "2799342" and "3042414"? They're not the "references" which are in the form A-204905255 and QC-CR#2998013 for example (and the number doesn't match up).
Those are clearly wrong :) I'll ask the folks generating this data about it.
ok thanks
Speaking of which in the OSV data there's a lot of aliases of just some number:
["CVE-2020-3671","2576091"]
which makes it... hard to track, or know where it came from, e.g. unless an alias is unique (so not just a number) and identifiable, e.g. CVE-, GSD-, RHSA-, it's basically just confusing. To quote the docs:
The aliases field gives a list of IDs of the same vulnerability in other databases, in the form of the id field. This allows one database to claim that its own entry describes the same vulnerability as one or more entries in other databases. Or if one database entry has been deduplicated into another in the same database, the duplicate entry could be written using only the id, modified, and aliases field, to point to the canonical one.
but there's no data around which database that alias is for...
I feel like I should open up another ticket to discuss this as a pile of random numbers that end up overlapping between different entries is going to be problematic.
Moved here as per https://github.com/cloudsecurityalliance/gsd-database/issues/2389