terriko commented 1 year ago

related: https://github.com/intel/cve-bin-tool/issues/2230
expanding the idea from https://github.com/intel/cve-bin-tool/issues/2354

cve-bin-tool: Improved product representation & meta-info about products.

Project description

We currently just report whatever we called a thing internally in the binary scans, and whatever it was called in the file in non-binary scans such as SBOM or package list parsing, but it would be nice to include things like software heritage designations, especially to allow for de-duplication if we combine scans from multiple BOMs. We might also want to see how viable it is to provide other commonly desired meta-info like licensing, source urls, packaging data, etc.
For packages that come through our "language parsers" some of this data may already be available to us, we just aren't storing or using it fully. We should be able to use a lot of databases of meta-information directly and periodically refresh them from live sources as needed.
We also mostly just do a "search by exact name in NVD" style matching -- using metadata from package managers may allow us to do better in cases where the product name is either very generic and re-used in different environments (e.g. something like json-parser) or cases where a product has changed names/CPE designations and needs more than one (looking out our other checkers, this happens fairly frequently in the linux package data), and we may also want to allow users to add data to improve scans.
You will almost certainly need to build a data format for de-dupe / meta data and allow users to be able to add to it, similar to how we have checkers right now.
The goal here isn't perfection, but if we could say tag 50-70% of things with additional meta-data that might be enough.

An example of where this gets messy:

The text-parsing library Beautiful Soup is available in pip as beautifulsoup4, on debian based systems as python3-bs4 and on fedora-based systems as python-beautifulsoup4. If we were detecting this package in all those formats, it would be useful to have meta data that told us

what {vendor, product} pair(s) it has in NVD
mapping showing that all of those packages refer to the same software. That includes correlating the names but also the sources. We don't want to accidentally think that say, a ruby json-parser and a python json-parser are necessarily the same code.
additional data such as source url, license that we could include in reports (and canonical source urls are surprisingly hard -- some distros point to their own forks, for example)
and so on.

Skills

python
ability to read json/xml data formats (if you don't know it yet you can learn it before we start),
some understanding of software packaging for at least one language/linux distro would be helpful
understanding of SBOMs or license management would be helpful

Difficulty level

medium/hard.
The actual code needed here will often be simple, but you're going to need to be able to grok packaging and learn how it works across a variety of systems and make a lot of judgement calls on how to use data, what to store, etc.
There are a lot of "unsolved" packaging issues and cases where groups disagree about optimal solutions.

Project Length

350 hours (e.g. full-time for 10 weeks or part-time for longer)

GSoC Participants Only

This issue is a potential project idea for GSoC 2023, and is reserved for completion by a selected GSoC contributor. Please do not work on it outside of that program. If you'd like to apply to do it through GSoC, please start by reading #2230 .

terriko commented 1 year ago

Some examples of existing metadata:

Python's packaging:
- Simple example in packaging tutorial: https://packaging.python.org/en/latest/tutorials/packaging-projects/#configuring-metadata
- Longer spec: https://packaging.python.org/en/latest/specifications/declaring-project-metadata/#declaring-project-metadata
RPM packaging
- Spec File with list of metadata https://rpm-packaging-guide.github.io/#what-is-a-spec-file
- full guide: https://rpm-packaging-guide.github.io/

If anyone finds more while doing research for this project, please mention them in comments below. I'd expect to have a similar link for all the language parsers we use as well as all the linux package formats we support.

terriko commented 1 year ago

Compiling some thoughts from other discussions...

How should we store this data?

This is an open question. Some options:

directly in checker files (e.g. add things like SOURCE_URL, SPDX_LICENSE into the same python file used for checkers).
- This has both advantages (easy for humans to edit, goes alongside other information we store) and disadvantages (how do we handle distinguishing binary checkers? Is this going to make adding a checker seem incredibly complicated?)
In a separate data structure (either another python file or something like json)
In the database (not my favourite since it's not as easy to do pull requests, but you might find it viable to load easier-to-edit json files into the db)

A GSoC proposal could suggest any of these (or even something I haven't listed) but you will need to think about your choice and what the data structure would look like and put that into your proposal (it's one of the ways we'll use to gauge if you understand the problem well enough to break it down and solve it).

If we can get this data from various meta-data sources, why do we need to store anything?

The short answer is that most of the meta data sources we're considering are inconsistent -- licenses don't always match, source urls will be different, etc. So we're expecting to need to store some to provide a consistent output. You might think of it sort of like data triage information, similar to the way we might add vulnerability triage information before producing a final report.

Some of this meta data may also be used like signatures to identify or de-duplicate components (imagine, for example, combining information in SBOMS made with different tools.)

How do we decide which meta data sources are most valuable/worth keeping?

Ah, now that's a big open question. (in fact, it is the big open question for most Big Data projects.) But it may also be the wrong question, since we may need to keep even lower-quality sources of info in order to help us de-duplicate and identify components.

I expect we'll find that some data sources are very useful and complete, and some are incomplete but useful when they're there, and some may just not be useful enough to include. It's also going to depend on what data output is most useful to users, and also what people have as input. Figuring that out will likely happen during GSoC as you see the data and try to figure out how best to use it.

We're expecting some amount of exploration of the data and making comparison/mapping tables and whatnot to inform the code, rather than just mapping them all into a data structure. So you're going to expect to keep a lot of different data and have structure with something like 5 different "source_url" style fields that we might match on in future if we see any one of them in, say, an spdx input. But we might decide to only use one of those as our preferred source_url in sdpx output even though we know and recognize them all as being the same product.

terriko commented 1 year ago

More notes from private queries. PLEASE ASK QUESTIONS HERE IF YOU CAN, it makes my life so much easier if I don't have to keep paraphrasing.

Another source of metadata: https://github.com/intel/cve-bin-tool/issues/2819 mentions a site that might be useful for getting names of packages across different linux distros. There probably will be more sources in the wide world we want to look at. Starting with the ones you know and leaving the code so that more can be added later is sufficient for GSoC.

what to do with the data once you have it: figure out how what's appropriate to store (or cache & verify/update), figure out how to use it. There's two big use cases I'd expect a completed GSoC project to support:

SBOM export (see PR https://github.com/intel/cve-bin-tool/pull/2817 ). An SBOM would usually have license data, source url, etc. and we could use the metadata to produce that. Potentially people will want that in HTML/PDF reports as well.
But for some of the meta data sources, you'll be able to use the data to improve NVD lookups. Right now for language parsers we just search for "product" in NVD and hope we got the right one (or multiple) results -- the meta data would be able to say "if requirements.txt wants foobarbaz I need to look up {foobarbaz, foobarbaz} and {mikethedev, foobarbaz}" (We do this already with the binary checkers, but we don't have any way to do the same with the language parsers because we don't have a lookup table of data about language packages the way we do with binary packages.)

There's also some SBOM/scan management tooling that I think we might want to do:

deduplication of components (e.g for combining sboms or scans)
SBOM quality tooling (e.g. making sure all licenses are valid SPDX short codes and are correct, normalizing URLs, warnings about potential typos in package names if something doesn't match up with our data)

For questions about "live" sources I think you probably want to ask @anthonyharrison about because he's got a few that he thinks would be useful, but there's a bunch of data sources where we don't necessarily have the licenses/rights to make copies, but they let us look things up via an API or website or whatever. (Note that if a data source doesn't have a license, we should not make assumptions about us being allowed to cache it. NVD is licensed public, but not every other thing is the same.)

anthonyharrison commented 1 year ago

Re License detection for SBOM inclusion. For source files this is relatively straightforward but extraction of license information from a binary would be an interesting challenge. I was wondering if we could create a new checker looking for SPDX-License-Identifier in a binary. Thoughts @terriko ?

terriko commented 1 year ago

@anthonyharrison I'm not sure how often that would even show up since the SPDX license is often a comment at the top of file rather than a string, but we won't know until we check. Maybe a viable stretch goal in this GSoC project, or we could move it to a separate issue that someone could investigate for fun outside of the scope of this project.

anthonyharrison commented 1 year ago

@terriko I agree but we also scan source files so we might be able to find something - maybe in metadata? Agree it is worthy of an investigation - I already do something when building an SBOM from a source file.

terriko commented 1 year ago

On a related note, this recently filed bug is exactly the sort of thing I want to use metadata for:

https://github.com/intel/cve-bin-tool/issues/2846 Here, cve-bin-tool failed to look up some components because the names weren't exact matches with the nvd database. If we knew about those package names, maybe we could have done better?

metabiswadeep commented 1 year ago

@anthonyharrison Could you please list which 'live sources' I need to search to get the required metadata for this?

metabiswadeep commented 1 year ago

Moreover, in NVD advanced search in https://nvd.nist.gov/vuln/search I only find cve identifier, product, vendor, cvss metrics, published date range. So should the N:N search that is being stated include only these parameters? And if so where can I find metadata for things like cve identifier and CVSS metrics as I searched for it in SBOMs but could not find it.

anthonyharrison commented 1 year ago

@metabiswadeep Have a look at https://release-monitoring.org/ to help with the issues with package names. This should help with component synonyms.

SBOMs may have PURL references (they won't have CVE or CVSS data). The PURL spec will indicate the namespace to search for the product e.g. PURL:pypi/cve-bin-tool@3.2 will indicate that the component can be found on PyPI.

The CVE database is being updated with an updated JSON schema will include additional information shortly (although the data hasn't fullytransitioned yet)

metabiswadeep commented 1 year ago

@anthonyharrison So do I not fetch any live sources to extract data from it, and just add the data manually like checkers into seperate files?

metabiswadeep commented 1 year ago

SBOMs may have PURL references (they won't have CVE or CVSS data). The PURL spec will indicate the namespace to search for the product e.g. PURL:pypi/cve-bin-tool@3.2 will indicate that the component can be found on PyPI.

So how do I add additional metadata for improving NVD search results (now cve-bin-tool only uses vendor, product and version) if I do not have the appropriate metadata corresponding to the parameters of the NVD API?

metabiswadeep commented 1 year ago

@anthonyharrison Also where can I get a list of all potential parameters (like vendor, product and version) which can be used while conducting NVD API lookups?

terriko commented 1 year ago

@metabiswadeep the {vendor, product} used by NVD is part of the CPE id. You can grab the whole list of them here, I think: https://nvd.nist.gov/products/cpe

Some notes:

It used to be a big process to get your own CPE id, they streamlined the process around 5 years ago if I recall correctly
Despite the time required to set up a new CPE entry, I would say that it's fairly common for older projects to have 2-3 CPE ids associated with vulnerabilities (just look at our checkers). Sometimes it's due to a product changing hands; I'm not sure if transferring CPE ownership is really hard, if people are bad at finding an existing CPE, or what, since it doesn't seem to always be due to a name change.
the vast majority of products don't have CPE ids
many smaller open source products will show up under Redhat -- sometimes that's because redhat developed them, but it also happens if redhat packages them. Redhat remains the expert at filing open source CVEs, I suspect.

terriko commented 9 months ago

Closing this (and all the other leftover gsoc ideas from previous years) in order to help folk focus on the new project idea descriptions.

intel / cve-bin-tool

GSoC 2023 Project idea: Improved product representation & meta-info about products. #2633

cve-bin-tool: Improved product representation & meta-info about products.

Project description

Skills

Difficulty level

Project Length

GSoC Participants Only

How should we store this data?

If we can get this data from various meta-data sources, why do we need to store anything?

How do we decide which meta data sources are most valuable/worth keeping?