Closed terriko closed 9 months ago
Some examples of existing metadata:
If anyone finds more while doing research for this project, please mention them in comments below. I'd expect to have a similar link for all the language parsers we use as well as all the linux package formats we support.
Compiling some thoughts from other discussions...
This is an open question. Some options:
SOURCE_URL
, SPDX_LICENSE
into the same python file used for checkers).
A GSoC proposal could suggest any of these (or even something I haven't listed) but you will need to think about your choice and what the data structure would look like and put that into your proposal (it's one of the ways we'll use to gauge if you understand the problem well enough to break it down and solve it).
The short answer is that most of the meta data sources we're considering are inconsistent -- licenses don't always match, source urls will be different, etc. So we're expecting to need to store some to provide a consistent output. You might think of it sort of like data triage information, similar to the way we might add vulnerability triage information before producing a final report.
Some of this meta data may also be used like signatures to identify or de-duplicate components (imagine, for example, combining information in SBOMS made with different tools.)
Ah, now that's a big open question. (in fact, it is the big open question for most Big Data projects.) But it may also be the wrong question, since we may need to keep even lower-quality sources of info in order to help us de-duplicate and identify components.
I expect we'll find that some data sources are very useful and complete, and some are incomplete but useful when they're there, and some may just not be useful enough to include. It's also going to depend on what data output is most useful to users, and also what people have as input. Figuring that out will likely happen during GSoC as you see the data and try to figure out how best to use it.
We're expecting some amount of exploration of the data and making comparison/mapping tables and whatnot to inform the code, rather than just mapping them all into a data structure. So you're going to expect to keep a lot of different data and have structure with something like 5 different "source_url" style fields that we might match on in future if we see any one of them in, say, an spdx input. But we might decide to only use one of those as our preferred source_url in sdpx output even though we know and recognize them all as being the same product.
More notes from private queries. PLEASE ASK QUESTIONS HERE IF YOU CAN, it makes my life so much easier if I don't have to keep paraphrasing.
Another source of metadata: https://github.com/intel/cve-bin-tool/issues/2819 mentions a site that might be useful for getting names of packages across different linux distros. There probably will be more sources in the wide world we want to look at. Starting with the ones you know and leaving the code so that more can be added later is sufficient for GSoC.
what to do with the data once you have it: figure out how what's appropriate to store (or cache & verify/update), figure out how to use it. There's two big use cases I'd expect a completed GSoC project to support:
{foobarbaz, foobarbaz}
and {mikethedev, foobarbaz}
" (We do this already with the binary checkers, but we don't have any way to do the same with the language parsers because we don't have a lookup table of data about language packages the way we do with binary packages.)There's also some SBOM/scan management tooling that I think we might want to do:
For questions about "live" sources I think you probably want to ask @anthonyharrison about because he's got a few that he thinks would be useful, but there's a bunch of data sources where we don't necessarily have the licenses/rights to make copies, but they let us look things up via an API or website or whatever. (Note that if a data source doesn't have a license, we should not make assumptions about us being allowed to cache it. NVD is licensed public, but not every other thing is the same.)
Re License detection for SBOM inclusion. For source files this is relatively straightforward but extraction of license information from a binary would be an interesting challenge. I was wondering if we could create a new checker looking for SPDX-License-Identifier in a binary. Thoughts @terriko ?
@anthonyharrison I'm not sure how often that would even show up since the SPDX license is often a comment at the top of file rather than a string, but we won't know until we check. Maybe a viable stretch goal in this GSoC project, or we could move it to a separate issue that someone could investigate for fun outside of the scope of this project.
@terriko I agree but we also scan source files so we might be able to find something - maybe in metadata? Agree it is worthy of an investigation - I already do something when building an SBOM from a source file.
On a related note, this recently filed bug is exactly the sort of thing I want to use metadata for:
@anthonyharrison Could you please list which 'live sources' I need to search to get the required metadata for this?
Moreover, in NVD advanced search in https://nvd.nist.gov/vuln/search I only find cve identifier, product, vendor, cvss metrics, published date range. So should the N:N search that is being stated include only these parameters? And if so where can I find metadata for things like cve identifier and CVSS metrics as I searched for it in SBOMs but could not find it.
@metabiswadeep Have a look at https://release-monitoring.org/ to help with the issues with package names. This should help with component synonyms.
SBOMs may have PURL references (they won't have CVE or CVSS data). The PURL spec will indicate the namespace to search for the product e.g. PURL:pypi/cve-bin-tool@3.2 will indicate that the component can be found on PyPI.
The CVE database is being updated with an updated JSON schema will include additional information shortly (although the data hasn't fullytransitioned yet)
@anthonyharrison So do I not fetch any live sources to extract data from it, and just add the data manually like checkers into seperate files?
SBOMs may have PURL references (they won't have CVE or CVSS data). The PURL spec will indicate the namespace to search for the product e.g. PURL:pypi/cve-bin-tool@3.2 will indicate that the component can be found on PyPI.
So how do I add additional metadata for improving NVD search results (now cve-bin-tool only uses vendor, product and version) if I do not have the appropriate metadata corresponding to the parameters of the NVD API?
@anthonyharrison Also where can I get a list of all potential parameters (like vendor, product and version) which can be used while conducting NVD API lookups?
@metabiswadeep the {vendor, product}
used by NVD is part of the CPE id. You can grab the whole list of them here, I think: https://nvd.nist.gov/products/cpe
Some notes:
Closing this (and all the other leftover gsoc ideas from previous years) in order to help folk focus on the new project idea descriptions.
cve-bin-tool: Improved product representation & meta-info about products.
Project description
json-parser
) or cases where a product has changed names/CPE designations and needs more than one (looking out our other checkers, this happens fairly frequently in the linux package data), and we may also want to allow users to add data to improve scans.An example of where this gets messy:
The text-parsing library Beautiful Soup is available in pip as beautifulsoup4, on debian based systems as python3-bs4 and on fedora-based systems as python-beautifulsoup4. If we were detecting this package in all those formats, it would be useful to have meta data that told us
Skills
Difficulty level
Project Length
GSoC Participants Only
This issue is a potential project idea for GSoC 2023, and is reserved for completion by a selected GSoC contributor. Please do not work on it outside of that program. If you'd like to apply to do it through GSoC, please start by reading #2230 .