intel / cve-bin-tool

The CVE Binary Tool helps you determine if your system includes known vulnerabilities. You can scan binaries for over 200 common, vulnerable components (openssl, libpng, libxml2, expat and others), or if you know the components used, you can get a list of known vulnerabilities associated with an SBOM or a list of components and versions.
https://cve-bin-tool.readthedocs.io/en/latest/
GNU General Public License v3.0
1.18k stars 455 forks source link

Add XML Schema validation to places we use XML #1507

Closed terriko closed 2 years ago

terriko commented 2 years ago

@anthonyharrison and I had a quick discussion about doing schema validation of XML before it's loaded. We've got a few places where we load XML that should have known schemas: the new java package parser added in the PR above and the SBOM parser.

It looks like xmlschema will work for what we need.

Since the schemas we're talking about shouldn't change very frequently (unlike the NVD data), we probably want to cache them and may want to consider putting them directly into the cve-bin-tool package so they're immediately available to users.

Note that xml schema validation is commonly used as a defense-in-depth measure for safe XML parsing. The defusedxml parser we're using does in fact fail on malformed data and has built-in protection against a number of known XML attacks. Adding schema validation would still add another layer of protection against malformed data and improve error messages for the user.

anthonyharrison commented 2 years ago

@terriko I have been looking at this now and I have uncovered a few challenges!

SPDX doesn't currently have a schema for XML documents! I have raised an issue [https://github.com/spdx/spdx-spec/issues/615] but obviously it will take time for an 'official' schema to be issued. I have looked at generating one (and will propose it as the inital schema) to be used in the interim.

The CYCLONEDX schema references another schema. I am still trying to work out how to get this to work locally (I am trying to avoid modifying the 'official' schema) so that the validation can work in offline mode. The validation works when connected to the internet because it can access the referenced schema through the referenced URL.

The SWID schema contains entities which the xml validator doesn't like as these are protected against attacks https://xmlschema.readthedocs.io/en/latest/features.html#xml-entity-based-attacks-protection

It looks like this might not be a quick fix but it might be worth getting the initial framework setup and maybe only offer validation when operating in 'connected' mode initalliy for CycloneDX and SPDX (with my initial schema). Validation of SWID documents might have to be left for another day (but to be honest I haven't come across many SWID documents so this might not be too much of an inconvenience).

anthonyharrison commented 2 years ago

UPDATE

I have now created a SPDX schema and shared it with SPDX community.

I have created a local copy of the CycloneDX schema to include the SPDX schema file which was being referenced. Will add a readme file to explain what I did so it can be replicated for future updates of the schema.

I have created a SWID schema to try and get around the entity problem. It works with the test file but needs more examples to make sure it works in all cases.

Validating POM files seems to be particularly problematic with the majority failing validation from the tests I have being doing. This was becuase the XSD file that I downloaded from Apache was mal-formed (lots of unclosed
tags in the descriptions. Manually changed the schema so that it now works.

I have added a --disable-validation-check flag to override the schema check. This caters for the case when a SBOM is using a different version of the schema (e.g. version 1.2) when the file is being validated against version 1.3 (or simply to stop the scan being terminated due to a validation failure). A future enhancement might be to support multiple versions of a schema and see if the file validates against any version.

The schema validation now works in offline mode.

Hope to push all of the files shortly once the quality checks have been completed.

terriko commented 2 years ago

This is great, thank you!