Closed terriko closed 6 months ago
I'll also note that we talked a bunch about this issue with respect to the gsoc project we'd hoped to have on metadata:
This would be a lesser form of what we envisioned there, just laying the groundwork to allow for adding manual metadata rather than trying to make use of other data sources.
(If GSoC happens again next year that project idea may be offered again; we didn't have enough mentors available to take on someone to do it this year.)
Per discussion in #3193 and in today's team meeting:
PURL apparently has the ability to let us indicate that something came from python to help us de-dupe. @anthonyharrison is looking into parsing PURL data out of SBOMs, but that won't directly solve this problem because these scan results are from our language parsers and not from an SBOM scan.
But if we're going to support PURL well, it seems like we're going to want a de-dupe table for that that would look something like this
language parser name
and the productname
){vendor, product}
) mappingsAnd that looks pretty similar to what we had above, only we had {product, language}
columns. I think maybe we could convert those to PURL internally right now?
So I think if we make a table that could handle both (I'm going to call it explicit_product_mapping
for now) it would look something like this for now...
CREATE TABLE IF NOT EXISTS explicit_product_mapping (
purl TEXT,
valid_cpe_list TEXT,
invalid_cpe_list TEXT,
primary_key(purl)
)
Note that I made the cpe mappings into lists, because that allows us to use PURL as an index for easier lookups. I could likely be convinced that there's something better we could be doing.
Looking at https://github.com/package-url/purl-spec right now, then our python de-dupes for the existing open bugs would look like...
"pkg:pypi/docutils"
(not sure if that should be pypi
or python
but we'd look it up){}
{"cpe:2.3:a:nim-lang:docutils"}
Note that I'm truncating all the version data for now because I don't feel like typing out extra stuff. We might want to use instead of truncating in practice. (Adding might make it easier to parse and make the things we use more correct and extensible.)
So when we then go scan something named docutils from a requirements.txt file, it will drop all the nim-lang
results. For now, we probably want to err on the side of matching, so if someone adds a new docutils we should display it until it's added to the de-dupe list. That might not be the case if we actually do know what the valid CPE should be; we can tweak that if we start getting some of those in future.
Then the other side of this is how we want to store, update and load the data into the database where we'll actually use it:
Adding some notes on my current architecture:
purl2cpe
won't solve our false positive problem completely because it often occurs with things that just don't really have a purl2cpe mapping (and we'd likely fall back on our current heuristic then). So we need a purl2NotCPE mapping, and we should probably maintain it ourselves here.purl2cpe
uses yml files in their source tree to solve the "humans need to be able to do pull requests against this" problem, and having played around with some json schemas I think I agree that .yml is a little more human-friendlySo my current thinking of how I'm going to break this up:
Number 2 is still kind of a doozy, so I might need to break it down some more.
I think at this point this has been replaced by the planned work in #3771 so I'll close this as a duplicate.
I've now hit two cases where find_vendor is finding a product with the same name but different version numbers:
I think it's time for us to build in some de-duplication in cases like this where we're clearly generating false positives for folk.
Since these are currently all coming from the language parsers, I think the logical place to start is in extending the language parser's
find_version()
function, found here:https://github.com/intel/cve-bin-tool/blob/main/cve_bin_tool/parsers/__init__.py
Right now, it uses cvedb's
get_vendor_product_pairs
to search for a product name and return all matches. That works pretty well in a lot of cases, and people can always mark the ones that aren't correct as false positives using triage. But that's a pain, and in both these cases we know that we're not finding the right thing because we know we're looking for a python package. So it would be really nice if we could have find_version() say "look, here's a list of known duplicate product names, let's discard them before the user even sees them"I'm imagining a file per language, so you'd have a set of files like
python-dedupe.json
andrust-dedupe.json
each with different entries. (I don't love those filenames, but something that included the language and was in a human-editable format would be good. Json is probably the best balance of human-editable and machine-readable for our user base.)An entry is going to need the following data:
get_vendor_product_pairs
finds them, as in the issues linked at the top of this post){vendor, product}
pairs associated with them)Presumably we'd have some entries with only NOT lists and some would have only ARE lists, so you wouldn't require both.
You'd load this structure into somewhere (Right into the db for easy lookup? I don't think we want to load/parse on every
find_version
call.) and use it to streamline whatfind_version()
then returns.Thoughts? Better ideas? I'm going to tag @XDRAGON2002 specifically since he laid the groundwork for our current parser API, but everyone's thoughts are welcome.