intel / cve-bin-tool

The CVE Binary Tool helps you determine if your system includes known vulnerabilities. You can scan binaries for over 200 common, vulnerable components (openssl, libpng, libxml2, expat and others), or if you know the components used, you can get a list of known vulnerabilities associated with an SBOM or a list of components and versions.
https://cve-bin-tool.readthedocs.io/en/latest/
GNU General Public License v3.0
1.23k stars 463 forks source link

GSoC 2023 Project idea: Improved product representation & meta-info about products. #2633

Closed terriko closed 9 months ago

terriko commented 1 year ago

cve-bin-tool: Improved product representation & meta-info about products.

Project description

An example of where this gets messy:

The text-parsing library Beautiful Soup is available in pip as beautifulsoup4, on debian based systems as python3-bs4 and on fedora-based systems as python-beautifulsoup4. If we were detecting this package in all those formats, it would be useful to have meta data that told us

Skills

Difficulty level

Project Length

GSoC Participants Only

This issue is a potential project idea for GSoC 2023, and is reserved for completion by a selected GSoC contributor. Please do not work on it outside of that program. If you'd like to apply to do it through GSoC, please start by reading #2230 .

terriko commented 1 year ago

Some examples of existing metadata:

If anyone finds more while doing research for this project, please mention them in comments below. I'd expect to have a similar link for all the language parsers we use as well as all the linux package formats we support.

terriko commented 1 year ago

Compiling some thoughts from other discussions...

How should we store this data?

This is an open question. Some options:

  1. directly in checker files (e.g. add things like SOURCE_URL, SPDX_LICENSE into the same python file used for checkers).
    • This has both advantages (easy for humans to edit, goes alongside other information we store) and disadvantages (how do we handle distinguishing binary checkers? Is this going to make adding a checker seem incredibly complicated?)
  2. In a separate data structure (either another python file or something like json)
  3. In the database (not my favourite since it's not as easy to do pull requests, but you might find it viable to load easier-to-edit json files into the db)

A GSoC proposal could suggest any of these (or even something I haven't listed) but you will need to think about your choice and what the data structure would look like and put that into your proposal (it's one of the ways we'll use to gauge if you understand the problem well enough to break it down and solve it).

If we can get this data from various meta-data sources, why do we need to store anything?

The short answer is that most of the meta data sources we're considering are inconsistent -- licenses don't always match, source urls will be different, etc. So we're expecting to need to store some to provide a consistent output. You might think of it sort of like data triage information, similar to the way we might add vulnerability triage information before producing a final report.

Some of this meta data may also be used like signatures to identify or de-duplicate components (imagine, for example, combining information in SBOMS made with different tools.)

How do we decide which meta data sources are most valuable/worth keeping?

Ah, now that's a big open question. (in fact, it is the big open question for most Big Data projects.) But it may also be the wrong question, since we may need to keep even lower-quality sources of info in order to help us de-duplicate and identify components.

I expect we'll find that some data sources are very useful and complete, and some are incomplete but useful when they're there, and some may just not be useful enough to include. It's also going to depend on what data output is most useful to users, and also what people have as input. Figuring that out will likely happen during GSoC as you see the data and try to figure out how best to use it.

We're expecting some amount of exploration of the data and making comparison/mapping tables and whatnot to inform the code, rather than just mapping them all into a data structure. So you're going to expect to keep a lot of different data and have structure with something like 5 different "source_url" style fields that we might match on in future if we see any one of them in, say, an spdx input. But we might decide to only use one of those as our preferred source_url in sdpx output even though we know and recognize them all as being the same product.

terriko commented 1 year ago

More notes from private queries. PLEASE ASK QUESTIONS HERE IF YOU CAN, it makes my life so much easier if I don't have to keep paraphrasing.

Another source of metadata: https://github.com/intel/cve-bin-tool/issues/2819 mentions a site that might be useful for getting names of packages across different linux distros. There probably will be more sources in the wide world we want to look at. Starting with the ones you know and leaving the code so that more can be added later is sufficient for GSoC.

what to do with the data once you have it: figure out how what's appropriate to store (or cache & verify/update), figure out how to use it. There's two big use cases I'd expect a completed GSoC project to support:

  1. SBOM export (see PR https://github.com/intel/cve-bin-tool/pull/2817 ). An SBOM would usually have license data, source url, etc. and we could use the metadata to produce that. Potentially people will want that in HTML/PDF reports as well.
  2. But for some of the meta data sources, you'll be able to use the data to improve NVD lookups. Right now for language parsers we just search for "product" in NVD and hope we got the right one (or multiple) results -- the meta data would be able to say "if requirements.txt wants foobarbaz I need to look up {foobarbaz, foobarbaz} and {mikethedev, foobarbaz}" (We do this already with the binary checkers, but we don't have any way to do the same with the language parsers because we don't have a lookup table of data about language packages the way we do with binary packages.)

There's also some SBOM/scan management tooling that I think we might want to do:

For questions about "live" sources I think you probably want to ask @anthonyharrison about because he's got a few that he thinks would be useful, but there's a bunch of data sources where we don't necessarily have the licenses/rights to make copies, but they let us look things up via an API or website or whatever. (Note that if a data source doesn't have a license, we should not make assumptions about us being allowed to cache it. NVD is licensed public, but not every other thing is the same.)

anthonyharrison commented 1 year ago

Re License detection for SBOM inclusion. For source files this is relatively straightforward but extraction of license information from a binary would be an interesting challenge. I was wondering if we could create a new checker looking for SPDX-License-Identifier in a binary. Thoughts @terriko ?

terriko commented 1 year ago

@anthonyharrison I'm not sure how often that would even show up since the SPDX license is often a comment at the top of file rather than a string, but we won't know until we check. Maybe a viable stretch goal in this GSoC project, or we could move it to a separate issue that someone could investigate for fun outside of the scope of this project.

anthonyharrison commented 1 year ago

@terriko I agree but we also scan source files so we might be able to find something - maybe in metadata? Agree it is worthy of an investigation - I already do something when building an SBOM from a source file.

terriko commented 1 year ago

On a related note, this recently filed bug is exactly the sort of thing I want to use metadata for:

metabiswadeep commented 1 year ago

@anthonyharrison Could you please list which 'live sources' I need to search to get the required metadata for this?

metabiswadeep commented 1 year ago

Moreover, in NVD advanced search in https://nvd.nist.gov/vuln/search I only find cve identifier, product, vendor, cvss metrics, published date range. So should the N:N search that is being stated include only these parameters? And if so where can I find metadata for things like cve identifier and CVSS metrics as I searched for it in SBOMs but could not find it.

anthonyharrison commented 1 year ago

@metabiswadeep Have a look at https://release-monitoring.org/ to help with the issues with package names. This should help with component synonyms.

SBOMs may have PURL references (they won't have CVE or CVSS data). The PURL spec will indicate the namespace to search for the product e.g. PURL:pypi/cve-bin-tool@3.2 will indicate that the component can be found on PyPI.

The CVE database is being updated with an updated JSON schema will include additional information shortly (although the data hasn't fullytransitioned yet)

metabiswadeep commented 1 year ago

@anthonyharrison So do I not fetch any live sources to extract data from it, and just add the data manually like checkers into seperate files?

metabiswadeep commented 1 year ago

SBOMs may have PURL references (they won't have CVE or CVSS data). The PURL spec will indicate the namespace to search for the product e.g. PURL:pypi/cve-bin-tool@3.2 will indicate that the component can be found on PyPI.

So how do I add additional metadata for improving NVD search results (now cve-bin-tool only uses vendor, product and version) if I do not have the appropriate metadata corresponding to the parameters of the NVD API?

metabiswadeep commented 1 year ago

@anthonyharrison Also where can I get a list of all potential parameters (like vendor, product and version) which can be used while conducting NVD API lookups?

terriko commented 1 year ago

@metabiswadeep the {vendor, product} used by NVD is part of the CPE id. You can grab the whole list of them here, I think: https://nvd.nist.gov/products/cpe

Some notes:

terriko commented 9 months ago

Closing this (and all the other leftover gsoc ideas from previous years) in order to help folk focus on the new project idea descriptions.