intel / cve-bin-tool

The CVE Binary Tool helps you determine if your system includes known vulnerabilities. You can scan binaries for over 200 common, vulnerable components (openssl, libpng, libxml2, expat and others), or if you know the components used, you can get a list of known vulnerabilities associated with an SBOM or a list of components and versions.
https://cve-bin-tool.readthedocs.io/en/latest/
GNU General Public License v3.0
1.23k stars 464 forks source link

feat: Improve matching for language parsers (avoid name collisions, use purl) #3180

Closed terriko closed 6 months ago

terriko commented 1 year ago

I've now hit two cases where find_vendor is finding a product with the same name but different version numbers:

I think it's time for us to build in some de-duplication in cases like this where we're clearly generating false positives for folk.

Since these are currently all coming from the language parsers, I think the logical place to start is in extending the language parser's find_version() function, found here:

https://github.com/intel/cve-bin-tool/blob/main/cve_bin_tool/parsers/__init__.py

Right now, it uses cvedb's get_vendor_product_pairs to search for a product name and return all matches. That works pretty well in a lot of cases, and people can always mark the ones that aren't correct as false positives using triage. But that's a pain, and in both these cases we know that we're not finding the right thing because we know we're looking for a python package. So it would be really nice if we could have find_version() say "look, here's a list of known duplicate product names, let's discard them before the user even sees them"

I'm imagining a file per language, so you'd have a set of files like python-dedupe.json and rust-dedupe.json each with different entries. (I don't love those filenames, but something that included the language and was in a human-editable format would be good. Json is probably the best balance of human-editable and machine-readable for our user base.)

An entry is going to need the following data:

Presumably we'd have some entries with only NOT lists and some would have only ARE lists, so you wouldn't require both.

You'd load this structure into somewhere (Right into the db for easy lookup? I don't think we want to load/parse on every find_version call.) and use it to streamline what find_version() then returns.

Thoughts? Better ideas? I'm going to tag @XDRAGON2002 specifically since he laid the groundwork for our current parser API, but everyone's thoughts are welcome.

terriko commented 1 year ago

I'll also note that we talked a bunch about this issue with respect to the gsoc project we'd hoped to have on metadata:

This would be a lesser form of what we envisioned there, just laying the groundwork to allow for adding manual metadata rather than trying to make use of other data sources.

(If GSoC happens again next year that project idea may be offered again; we didn't have enough mentors available to take on someone to do it this year.)

terriko commented 1 year ago

Per discussion in #3193 and in today's team meeting:

PURL apparently has the ability to let us indicate that something came from python to help us de-dupe. @anthonyharrison is looking into parsing PURL data out of SBOMs, but that won't directly solve this problem because these scan results are from our language parsers and not from an SBOM scan.

But if we're going to support PURL well, it seems like we're going to want a de-dupe table for that that would look something like this

  1. PURL mapping (this would be where we'd store the language parser name and the productname)
  2. valid NVD CPE ({vendor, product}) mappings
  3. invalid NVD CPE mappings that should not be returned

And that looks pretty similar to what we had above, only we had {product, language} columns. I think maybe we could convert those to PURL internally right now?

So I think if we make a table that could handle both (I'm going to call it explicit_product_mapping for now) it would look something like this for now...


CREATE TABLE IF NOT EXISTS explicit_product_mapping (
    purl TEXT,
    valid_cpe_list TEXT,
    invalid_cpe_list TEXT,
    primary_key(purl)
)

Note that I made the cpe mappings into lists, because that allows us to use PURL as an index for easier lookups. I could likely be convinced that there's something better we could be doing.

Looking at https://github.com/package-url/purl-spec right now, then our python de-dupes for the existing open bugs would look like...

Note that I'm truncating all the version data for now because I don't feel like typing out extra stuff. We might want to use instead of truncating in practice. (Adding might make it easier to parse and make the things we use more correct and extensible.)

So when we then go scan something named docutils from a requirements.txt file, it will drop all the nim-lang results. For now, we probably want to err on the side of matching, so if someone adds a new docutils we should display it until it's added to the de-dupe list. That might not be the case if we actually do know what the valid CPE should be; we can tweak that if we start getting some of those in future.

terriko commented 1 year ago

Then the other side of this is how we want to store, update and load the data into the database where we'll actually use it:

terriko commented 1 year ago

Adding some notes on my current architecture:

So my current thinking of how I'm going to break this up:

  1. add purl generation to the language parsers
  2. Set up some sort of purl2Not directory and load-in code, with instructions about how to add to it
  3. Set up purl2Not as a data source so cve-bin-tool can grab new data from us
  4. Make sure the language parsers use our purl2Not data
  5. Include purl2cpe and use it for language parsers
  6. Add purl2cpe usage for SBOM

Number 2 is still kind of a doozy, so I might need to break it down some more.

terriko commented 6 months ago

I think at this point this has been replaced by the planned work in #3771 so I'll close this as a duplicate.