feat: Improve matching for language parsers (avoid name collisions, use purl)

terriko commented 1 year ago

I've now hit two cases where find_vendor is finding a product with the same name but different version numbers:

I think it's time for us to build in some de-duplication in cases like this where we're clearly generating false positives for folk.

Since these are currently all coming from the language parsers, I think the logical place to start is in extending the language parser's find_version() function, found here:

https://github.com/intel/cve-bin-tool/blob/main/cve_bin_tool/parsers/__init__.py

Right now, it uses cvedb's get_vendor_product_pairs to search for a product name and return all matches. That works pretty well in a lot of cases, and people can always mark the ones that aren't correct as false positives using triage. But that's a pain, and in both these cases we know that we're not finding the right thing because we know we're looking for a python package. So it would be really nice if we could have find_version() say "look, here's a list of known duplicate product names, let's discard them before the user even sees them"

I'm imagining a file per language, so you'd have a set of files like python-dedupe.json and rust-dedupe.json each with different entries. (I don't love those filenames, but something that included the language and was in a human-editable format would be good. Json is probably the best balance of human-editable and machine-readable for our user base.)

An entry is going to need the following data:

the "productname" that we'd expect the language parser to find
a list of {vendor, product} pairs that are NOT this package (so they should be discarded if get_vendor_product_pairs finds them, as in the issues linked at the top of this post)
a list of {vendor, product} pairs that ARE this package. (not needed for the bugs above but I can easily imagine cases where this might be useful, especially given that many of our binary checkers have multiple {vendor, product} pairs associated with them)
If we're eventually going to load these all into a big shared/searchable table, we might also need the language/parser being used (e.g. "python") but that could be added during loading or something.

Presumably we'd have some entries with only NOT lists and some would have only ARE lists, so you wouldn't require both.

You'd load this structure into somewhere (Right into the db for easy lookup? I don't think we want to load/parse on every find_version call.) and use it to streamline what find_version() then returns.

Thoughts? Better ideas? I'm going to tag @XDRAGON2002 specifically since he laid the groundwork for our current parser API, but everyone's thoughts are welcome.

terriko commented 1 year ago

I'll also note that we talked a bunch about this issue with respect to the gsoc project we'd hoped to have on metadata:

https://github.com/intel/cve-bin-tool/issues/2633

This would be a lesser form of what we envisioned there, just laying the groundwork to allow for adding manual metadata rather than trying to make use of other data sources.

(If GSoC happens again next year that project idea may be offered again; we didn't have enough mentors available to take on someone to do it this year.)

terriko commented 1 year ago

Per discussion in #3193 and in today's team meeting:

PURL apparently has the ability to let us indicate that something came from python to help us de-dupe. @anthonyharrison is looking into parsing PURL data out of SBOMs, but that won't directly solve this problem because these scan results are from our language parsers and not from an SBOM scan.

But if we're going to support PURL well, it seems like we're going to want a de-dupe table for that that would look something like this

PURL mapping (this would be where we'd store the language parser name and the productname)
valid NVD CPE ({vendor, product}) mappings
invalid NVD CPE mappings that should not be returned

And that looks pretty similar to what we had above, only we had {product, language} columns. I think maybe we could convert those to PURL internally right now?

So I think if we make a table that could handle both (I'm going to call it explicit_product_mapping for now) it would look something like this for now...


CREATE TABLE IF NOT EXISTS explicit_product_mapping (
    purl TEXT,
    valid_cpe_list TEXT,
    invalid_cpe_list TEXT,
    primary_key(purl)
)

Note that I made the cpe mappings into lists, because that allows us to use PURL as an index for easier lookups. I could likely be convinced that there's something better we could be doing.

Looking at https://github.com/package-url/purl-spec right now, then our python de-dupes for the existing open bugs would look like...

purl: "pkg:pypi/docutils" (not sure if that should be pypi or python but we'd look it up)
valid_cpe_list: {}
invalid_cpe_list: {"cpe:2.3:a:nim-lang:docutils"}

Note that I'm truncating all the version data for now because I don't feel like typing out extra stuff. We might want to use instead of truncating in practice. (Adding might make it easier to parse and make the things we use more correct and extensible.)

So when we then go scan something named docutils from a requirements.txt file, it will drop all the nim-lang results. For now, we probably want to err on the side of matching, so if someone adds a new docutils we should display it until it's added to the de-dupe list. That might not be the case if we actually do know what the valid CPE should be; we can tweak that if we start getting some of those in future.

terriko commented 1 year ago

Then the other side of this is how we want to store, update and load the data into the database where we'll actually use it:

We'd want matching JSON structures for this data to be stored in github and kept up to date with pull requests
We'll want an explicitly defined jsonschema so the schema can be validated on every pull request
Maybe some formatting CI too? Does anyone have a favourite equivalent to black for json data?
We probably want to put some thought into how we store these and make them something that humans can read and edit: maybe a directory with files per "type" (e.g. maven, npm, gem, pypi) to map to our language parsers.
I'm guessing we want to load this into the database alongside the NVD and other data source data but we'll need to think about that:
- Should we include the json files in each release as a seed?
- Should we make it easy for users to reload their local copy if they want to try out changes?
- Should we make another data_source for the explicit mapping data?
- Should we point that data source at our github or at the mirror? Or both?

terriko commented 1 year ago

Adding some notes on my current architecture:

I took a look at https://github.com/scanoss/purl2cpe and would definitely like to use it. For the language parsers, that means we need to generate an internal purl for usage.
purl2cpe won't solve our false positive problem completely because it often occurs with things that just don't really have a purl2cpe mapping (and we'd likely fall back on our current heuristic then). So we need a purl2NotCPE mapping, and we should probably maintain it ourselves here.
purl2cpe uses yml files in their source tree to solve the "humans need to be able to do pull requests against this" problem, and having played around with some json schemas I think I agree that .yml is a little more human-friendly

So my current thinking of how I'm going to break this up:

add purl generation to the language parsers
Set up some sort of purl2Not directory and load-in code, with instructions about how to add to it
Set up purl2Not as a data source so cve-bin-tool can grab new data from us
Make sure the language parsers use our purl2Not data
Include purl2cpe and use it for language parsers
Add purl2cpe usage for SBOM

Number 2 is still kind of a doozy, so I might need to break it down some more.

terriko commented 6 months ago

I think at this point this has been replaced by the planned work in #3771 so I'll close this as a duplicate.

intel / cve-bin-tool

feat: Improve matching for language parsers (avoid name collisions, use purl) #3180