Spike around mapping CVEs to individual packages of an ecosystem

krishnapaparaju commented 6 years ago

With quite a bit of energy been spent around evaluating different sources for CVEs, either OSIO analytics platform were blocked by license terms of the data sources (or) inaccurate mapping of CVEs at NVEDB to individual packages of an ecosystem.

During the discussions with product security team, there is no 'global database for CVEs covering all the ecosystems' and accurate enough to flag a package at OSIO IDE.

With all the knowledge accumuted so far, this spike would manually / semi - automatically map CVEs to individual packages of an ecosystem. The end result of this spike would be to understand the (a) practicality of mapping CVEs for major ecosystems (Java , .NET etc.)
(b) effort involved around this.

@GeorgeActon @msrb

pkajaba commented 6 years ago

I had an idea about asking fedora/RHEL package maintainers for help on this one. These people usually have knowledge about CVE's and they can surely match CVE to a component.

@humaton told that package maintainers are at the end of the CVE announcement chain and there are people who might know better. He could elaborate more on this.

humaton commented 6 years ago

Yes, people in SRT dept should have the information much sooner, they are the people creating CVE bugzilla for package maintainers.

jpopelka commented 6 years ago

First I've been trying to make sure there's no better option than some (semi-)manual mapping.

There are several publicly available databases which use ecosystem(platforms):package-name (not CPE):

Snyk's vulndb - we were investigating this previously, but abandoned it because they had stopped updating data
OSS Index - we use currently for npm and nuget
Pyup's safety-db - python only, data seem to be incomplete
VictimsDB - mostly java-centric

Problem with these is that they don't cover all ecosystems. We also have to rely on fact that the maintainer will be updating data properly, which unfortunatelly mostly isn't a rule.

Another option is to use directly NVD or tools/services which use NVD as source of data:

OWASP Dependency Check - we currently use for maven and python. Maven is mostly OK, python not so much.
CVE Details - nice web, but no API
cve-search - imports data from NVD into local MongoDB instance in order to perform local search. We were using this previously, but abandoned it due to CPE to ecosystem:package (EP) mapping issues.
CIRCL cve-search - cve-search instance maintained by Computer Incident Response Center Luxembourg. See more here. It has nice API where one can search CVEs per vendor/product (i.e. CPE)

These are reliable sources because they get data from NVD, the problem however is the use of CPE (Common Platform Enumeration) for identification of packages, because the CPE can't be directly mapped to ecosystem:package system we use.

Examples:

Ecosystem(Platform):Package	CPE (vendor:product)
maven:org.asynchttpclient:async-http-client	async-http-client_project:async-http-client
maven:commons-fileupload:commons-fileupload	apache:commons_fileupload
maven:ch.qos.logback:logback-classic	logback:logback
maven:org.webjars.npm:moment	moment_project:moment
maven:org.opencms:org.opencms.workplace.administration	alkacon:opencms
pypi:mako	makotemplates:mako
pypi:google-appengine	google:app_engine_python_sdk
pypi:zope2	zope:zope
pypi:moin	moinmoin:moinmoin
pypi:requests	python-requests:requests && python:requests
pypi:salt	saltstack:salt
npm:geddy	geddyjs:geddy
npm:libyaml	pyyaml:libyaml
npm:dojo	dojotoolkit:dojo
nuget:System.Net.Http	microsoft:system.net.http
nuget:adplug	audacious_media_player_team:adplug

In an ideal world the (NVD assigned) CPE would be listed somewhere in metadata of each pip/npm/maven/nuget/etc. package. Then one could easily and reliably search in NVD and all the tools which use NVD data. There seem to have already been some initiatives to add CPEs to packages metadata in Debian and Gentoo, but that's not much useful for us.

Closest to what we want to achieve seems to be OWASP Dependency Check, which uses data from NVD and does some "guess-work" to search the vendor/product indexed data when given ecosystem/package. Question at the moment is whether we are able to do better "guess-work" or if we want to somehow (semi-)manually create the EP to CPE mapping for each EP.

CCing @pombredanne who initiated https://github.com/nexB/vulnerablecode project and might have better overview.

pombredanne commented 6 years ago

@jpopelka pong! You have built a fairly comprehensive list there. I could add a few ones for Debian, Ubuntu, Rubygems and FWIW I met with the cve-search folks a few times and we have plans to collaborate in this domain.

This ticket nails it: mapping reliably a Package (e.g. what you call an EP) to a CVE/vulnerability is NOT a solved problem and is the essence of any package vulnerability alerting system. There is unfortunately no decent nor comprehensive open data source for this.

To build reliable mappings I have been thinking of a few approaches:

infer from existing relationships: given a package data source with explicit mapping to a vulnerability (e.g. RedHat, Ubuntu, Debian and a few more), use this to get a corresponding CPE if it exist, and eventually generalize this to related packages (such as upstream or downstream)
parse CVEs/bulletins/etc: parse and match CVE texts and collect CPEs to align them with packages (using multiple heuristics for matching more or less reliably)
parse changelogs and bug/issues trackers to identify CVEs and "security-related" references and use these to align to CPEs (or directly to vulnerabilities)
using any of the above, and by clustering related packages together, eventually infer more CPEs or CVE mappings
take these inferred mappings and store explicit mappings in a DB
finally, have a community curation/review web system such that these mappings can be reviewed and confirmed as correct

Building this mapping is IMHO something that would have an immense value for openshift, vulnerablecode, any CIRT and security team , the FLOSS community at large and anyone using a a FLOSS package... And is eventually the primary goal for the tiny, fledgling, not 1/100th baked-yet vulnerablecode project.

This is also an immense effort where collaboration would make a lot of sense.

(note: I am talking vulnerabilities at large here and not only CVEs as a vulnerability may not always exist in the NVD especially when inferred from changelogs and bug trackers)

(note: there is an opportunity to define a "universal" package identifier beyond CPEs that could be used reliably across tools, similar to what you use for maven:org.webjars.npm:moment or what the new grafeas projects listed as "Resource URLs" )

CCing @adulau from CIRCL and cve-search and @PidgeyL from cve-search

jpopelka commented 6 years ago

I guess Github folks are tackling similar questions when implementing their Security alerts

pombredanne commented 6 years ago

@jpopelka you wrote:

I guess Github folks are tackling similar questions when implementing their Security alerts

So we should find out who handles this at GH so we can work things out together?

andrew commented 6 years ago

@pombredanne I believe @benjam has a contact at GH that we're talking too about this, and linking with Libraries.io data

jpopelka commented 6 years ago

I've been looking at how to more specifically obtain some EP (ecosystem/package-manager:package) to CPE mappings.

A) downstreams

a) Debian has a list of CPEs along with a deb package name. In the following example I'm getting CPEs for python packages that Debian ships:

$ svn checkout svn://svn.debian.org/svn/secure-testing
$ grep python secure-testing/data/CPE/list

python-bottle;cpe:/a:bottlepy:bottle
python-cherrypy;cpe:/a:cherrypy:cherrypy
python-cjson;cpe:/a:dan_pascu:python-cjson
...

For python it's easy, because they have 'python' prefix or suffix. The same applies to java whose packages have '-java' suffix in Debian. But there are only 40 python CPEs and 20 CPEs for java, which is hardly worth mentioning.

b) Gentoo has packages` metadata in metadata.xml files in https://github.com/gentoo/gentoo Some of them contain CPE, but I haven't found any python/java/etc. packages with CPE.

c) RedHat errata tool - CPEs in erratas are used to identify products (like. rhel-7) rather than individual components, so it's unlikely a CPE source. example (probably not accessible outside RedHat VPN)

d) RedHat (or other distros) bugzilla - we can search for CVEs in for example python- or nodejs- packages and then get corresponding CPEs from NVD. https://bugzilla.redhat.com/buglist.cgi?quicksearch=vulnerability+nodejs https://bugzilla.redhat.com/buglist.cgi?quicksearch=vulnerability+python But that also gives only few CVEs and doesn't work for java (leaving aside nuget) in which case the packages can't be easily identified from name.

B) existing vulnerability databases, which identify package-managers

for db in [pyupio-safety-db, victimsdb, ossindex, snyk-vulndb]:
  for package_manager, package in db:
    for CVE in package:
      cpes = get CPEs (for CVE) from NVD or cve-search
      for cpe in cpes:
        if package in cpe:
          package_identifier = package_manager + package
          cpe_mappings[package_identifier].append(cpe)

Are there any legal issues ? I'd say that as long as our cpe_mappings is open source there shouldn't be any.

C) References in CVEs See this example. It contains a CPE and several references point to github repository. If we get a EP (ecosystem/package-manager:package) from the repo, then we'll have the EP to CPE mapping. There are few possibilies how to discover what package manager is the repository upstream for:

language details in github repository
readme often contains instructions how to install the package, for example [npm/pip] install <package>
fabric8-analytics tracks upstream url to each analyzed EP component so we can check if there's any match.

It's probably more often for python/nodejs and less for java/nuget, whose sources are not so often on github. Other than to github, the references point to various mailing lists, bug/issues trackers - in these cases we can't say what to look for or how to decide whether the component is shipped by any package manager.

D) Other options ? CPE Dictionary - is only a list of CPEs without any context. Also I have no idea why I can't find some of the existing CPEs there, for example logback search doesn't find anything.

At the moment it's hard to tell how complete our mapping could be with this approach, what could possibly be automated and how much effort it'd take.

pombredanne commented 6 years ago

@jpopelka you wrote

At the moment it's hard to tell how complete our mapping could be with this approach, what could possibly be automated and how much effort it'd take.

Same conclusion here: there is not a single way and we eventually need all these ways with some crowdsourced/community curation to get something decent.

Other data sources to consider

processing and parsing package changelogs: they may have CVE references
Oval when available
RH rpm 2 cve: https://www.redhat.com/security/data/metrics/rpm-to-cve.xml which is a clear package->cve mapping
refmaps https://cve.mitre.org/data/refs/refmap/allrefmaps.zip and some CVE referenced there have package details such as this one: https://access.redhat.com/errata/RHSA-2017:0161 following this path
debian
and many more

For instance with refmaps: CVE (https://nvd.nist.gov/vuln/detail/CVE-2016-7103)
-> CPE from CVE -> errata for CVE through refmap -> get and parse errata to get actual packages

So there is no one solution: only a big semi structured and messy graph of stuff... and a big problem we all have.... ready to be solved!

pombredanne commented 6 years ago

@jpopelka you wrote:

Are there any legal issues ? I'd say that as long as our cpe_mappings is open source there shouldn't be any.

well there are legal issues as each source may have specific license terms and sometimes software license terms applied to data are hard to handle. See https://github.com/victims/victims-cve-db/issues/25 for instance. Some like pyup data prohibit commercial usage whatever this means.

So nothing as a straightforward slam dunk sigh

pombredanne commented 6 years ago

Another issue is if the data are current or not: see for instance https://github.com/snyk/vulnerabilitydb/issues/16#issuecomment-338619059 which makes the snyk data source mostly harmless.

Though they have an RSS feed https://snyk.io/vuln/feed.xml and IMHO a peculiar interpretation of the AGPL license which is a poor license for data anyway:

Snyk's Vulnerability DB RSS feed. This DB (feed and repository) is licensed under the AGPL-v3 license, which often allows use internally, but prohibits embedding the DB in another product or service, unless that product and provided service are open source and under the AGPL-v3 license. For a different license to Snyk's vulnerability DB, please contact us at contact@snyk.io

jpopelka commented 6 years ago

I agree with your licensing point of view. Few notes to other ideas:

processing and parsing package changelogs: they may have CVE references

Great idea, but I'm afraid the vulnerabilities are in many cases fixed before they get a CVE identifier.

Oval when available

I'm still failing to understand what data we could possibly get from it. From what I checked, I haven't seen anything more than what's in NVD.

refmaps and downstream (RH) erratas

Only a tiny fraction of existing components are shipped downstream as packages in linux distros (RH, Debian, etc.) so I expect only samples here.

Also, could you roughly explain to me if/how the existing cve-search/VIA4CVE fits into the picture we currently have ? Thank you.

pombredanne commented 6 years ago

@jpopelka

Re: parsing Changelogs, I am not making this up and this si not my idea... this can be for CVEs AND other security issue indicators. And I wished things were fixed before a CVE is published, but that is often not the case. This is likely in part one approach of pyup: https://github.com/pyupio/changelogs , and I know this is one approach of other non-FLOSS tools. The same applies to parsing/making sense of CVE body texts when there is no CPE reference.

Re: Oval and erratas and other structured sources: the interesting bit of data that is NOT in CVEs are actual package names (and therefore direct references to either CVEs and.or CPEs) eg. similar to the packages tab on https://access.redhat.com/errata/RHSA-2017:0484 .... Even if partial and "sample-grade", IMHO every little bit of data can help to make this mapping more automated to unwind the ball of twine one thread at a time.

Re: https://github.com/cve-search/via4cve this is a codebase with a lot of domain knowledge to aggregate vulnerabilities data. It does not provide the mapping you an I care for but has already a nice set of data sources, with python code that could then be mined.

goern commented 6 years ago

a question on another level... will OSIO provide a GA service so that other projects can query the database, eg ask OSIO for all CVE related to package XYZ?

pombredanne commented 6 years ago

@goern wrote

a question on another level... will OSIO provide a GA service so that other projects can query the database, eg ask OSIO for all CVE related to package XYZ?

I cannot say is there such a plan for OSIO, but on my side this is the exact goal of https://github.com/nexB/vulnerablecode as a fledgling project ;) Having this in OSIO would be awesome!

openshiftio / openshift.io

Spike around mapping CVEs to individual packages of an ecosystem #1052