kibook / s1kd-tools

A set of small, free and open source software tools for manipulating S1000D data.
https://khzae.net/1/s1000d/s1kd-tools
GNU General Public License v3.0
38 stars 13 forks source link

s1kd-validate - Not working with network entities? #4

Closed MihailCosmin closed 4 years ago

MihailCosmin commented 4 years ago

Hi,

I am trying to validate the schema for some DMs, and I am getting the below error:

s1kd-validate DMC-***.XML
s1kd-validate: ERROR: Attempt to load network entity http://www.s1000d.org/S1000D_4-2/xml_schema_flat/descript.xsd
s1kd-validate: ERROR: Failed to locate the main schema resource at 'http://www.s1000d.org/S1000D_4-2/xml_schema_flat/descript.xsd'.

Does the schema validator only work with local schemas?

kibook commented 4 years ago

@MihailCosmin

All of the s1kd-tools will refuse to make network connections by default, as a precaution against XML vulnerabilities like external entity attacks.

If you trust the network resources, you can enable network connections with the --net option to any of the tools:

$ s1kd-validate --net DMC-***.XML

You might want to consider using local copies of the schemas, though. The s1kd-validate tool stores each schema it reads over the network in-memory temporarily, so if you validate 100 descript DMs in one go, you won't actually be requesting http://www.s1000d.org/S1000D_4-2/xml_schema_flat/descript.xsd 100 times from the S1000D people's server, but there's still a lot of unnecessary overhead reading a static schema over the internet.

It's even worse with the S1000D 5.0 schemas, since they reference the standard W3C xml schema, and requests to http://www.w3.org/2001/xml.xsd are intentionally delayed by several seconds specifically to discourage people from using the schema over the network: https://www.w3.org/Help/Webmaster#slowdtd.

There's no need to change the schema URL in the DM in order to use a local copy. You can just use either an XML catalog, or the s1kd-validate -d/--schemas option.

An XML catalog looks something like this:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <rewriteURI
    uriStartString="http://www.s1000d.org"
    rewritePrefix="/usr/share/xml/s1000d/schemas"/>
</catalog>

That will redirect all requests for a URI starting with "http://www.s1000d.org" to the local directory "/usr/share/xml/s1000d/schemas". The directory structure should reflect the structure of the URL, so the full local path to the descript schema might be: /usr/share/xml/s1000d/schemas/S1000D_4-2/xml_schema_flat/descript.xsd

libxml2, the underlying XML parser used by the s1kd-tools, will read the XML catalog from some standard locations, like /etc/xml/catalog: http://xmlsoft.org/catalog.html#Simple

You can explicitly specify a catalog (or catalogs) to use with the XML_CATALOG_FILES environment variable:

$ XML_CATALOG_FILES=mycatalog.xml s1kd-validate DMC-***.XML

s1kd-validate also has its own alternative to XML catalogs, which is the -d/--schemas option. This option takes the path to a directory containing the local copies of the schemas. If you only need one issue of S1000D, you can just place the .xsd files in a single directory:

$ tree schemas
schemas/
|_ descript.xsd
|_ proced.xsd
...

$ s1kd-validate -d schemas DMC-***.XML

If you use multiple issues of S1000D, you just need to structure the schemas directory similarly to the XML catalog example above:

$ tree schemas
schemas/
|_ S1000D_4-1/
  |_ xml_schema_flat/
    |_ descript.xsd
    |_ proced.xsd
...
|_ S1000D_4-2/
  |_ xml_schema_flat/
    |_ descript.xsd
    |_ proced.xsd
...

$ s1kd-validate -d schemas DMC-***.XML
MihailCosmin commented 4 years ago

Yes, that did it. I need to read the s1kd-tools Documentation more :)

And yes, probably you are right, validating against local schemas should be the first option always.