Investigate additional Schematron Rules for GeekoDoc

tomschr commented 7 years ago

In openSUSE/suse-doc-style-checker#117, I raised the question if a Schematron schema could be useful for SDSC. The same question can be asked for GeekoDoc as well.

A Schematron schema can be used in two ways:

Embedded Schematron rules are embedded inside the RNG schema.
Separate Schematron rules are collected outside in a different file (extension .sch). They are independant of the existing GeekoDoc RNG.

The validation procedure would be different:

Validation with embedded Schematron rules The validation with Schematron would be an integral part. In other words, after structural validation the rule-based validation process would be performed. Both can't be separated.
Validation with separate Schematron schema The validation with a separate Schematron schema would be step-wise. First step would be always the structural validation with RNG. If wanted (or needed), additional validation can be performed with Schematron. Both validation processes can be separated.

Rick Jelliffe, the inventor of Schematron, describe the language as "a feather duster to reach the parts other schema languages cannot reach". ;-)

Benefits

Additional checks which cannot be expressed by RNG.
Relationship conditions don't need to be checked in SDSC.
Kind of structural quality checks (are there any lonely sections? Procedure with a single step?)
Conformance checks (IDs should adhere to a certain pattern?)
Schematron validation step can be optional or imperative depending on our definition of validation.
Additional validation step can be included into DAPS gradually.

Schematron Versions

Currently, there are two versions of Schematron:

ISO-Schematron (published Mai 2006) the de-facto standard of Schematron. The new namespace http://purl.oclc.org/dsdl/schematron.
Schematron 1.5 (published 2001) The old reference implementation in pure XSLT. The namespace is http://xml.ascc.net/schematron/.

Tools

Schematron validation are supported by:

xmllint and option --schematron.
The Python library lxml, see http://lxml.de/validation.html#id2
Jing supports Schematron 1.5. Implementation is partely XSLT and partely Java.

Personal

From my perspective, I prefer the separate Schematron schema (assuming all is possible, feasible, or useful). It seems, this doesn't introduce too many changes and gives greater flexibility.

I see it more as a "conformance and consistency" check rather than a hard validation. Of course, the rules shouldn't bother our writers too much.

Maybe we should also (re?)think about our definition of "validity/validation".

--

Update: List of Checks

Hard Rules

[ ] Import/check against the rules from docbook.sch (upstream DocBook).
[x] Check for spaces in xml:id
[x] Check for more than 1 step inside a procedure.
[x] Check for more than 1 member inside a simplelist.

Soft Rules

[ ] Check for more than 1 listitem inside orderelist or itemizedlist.
[ ] Check for more than 1 varlistentry inside variablelist.
[ ] Check if you have more than 10 steps inside a procedure.
[ ] Check for a title inside admonition elements (note, tip, warning).
[ ] Check for specific rules following xml:id attributes.
[ ] Check for lonely sections(?)

@sknorr I've separated the discussion in SDSC from the GeekoDoc aspect. Feel free to comment. :)

ghost commented 7 years ago

I guess adding this to GeekoDoc might be the better idea for the time being...

For an idea of what we could do with Schematron directly in GeekoDoc, see: https://github.com/openSUSE/suse-xsl/issues/222 . There is quite a number of cases associated with table markup and you generally notice those issues currently when going the step from FO->PDF because FOP balks.

This is also not really style checker territory because it really leads to hard errors that are not caught by current validation methods. Then again, if we have more such cases, we could move some checks from the style checker to GeekoDoc.

tomschr commented 7 years ago

DocBook >= 5.0 brings also some (ISO) Schematron files, see /usr/share/xml/docbook/schema/sch/5.1/docbook.sch. For example, it checks if footnote contains another footnote child.

However, it seems, oXygen is not that happy with the schema. It shows this error message:

cvc-complex-type.3.2.2: Attribute 'name' is not allowed to appear in element 's:pattern'.

This is the respective line:

<s:pattern name="Glossary 'firstterm' type constraint">

which should be corrected like this:

<s:pattern>
    <s:title>Glossary 'firstterm' type constraint</s:title>

ghost commented 7 years ago

The tools side of Schematron seems to be interesting ...

jing supports Schematron 1.5 (with some limitations, according to toms); toms says he does not really want to use the older version of the standard that is supported there
libxml (i.e. xmllint) has (some) Schematron 1.5 support [which is not mentioned in the man page]
lxml has ISO Schematron support (written in Python, needs a small wrapper, there is active development, provides Schematron->XSLT conversion based on reference implementation but no native Schematron implementation)
ph-schematron supports ISO Schematron but would be a new tool (written in Java, seems like there is active development, provides Schematron->XSLT conversion or native Schematron implementation) -- seems like our best shot
Probatron supports ?? (basically dead, but there are lots of forked projects on GitHub)

Websites related to Schematron are also interesting: They seem to either show lots of 404 errors (schematron.com has a working front page but all sub pages 404), lead to ad farms (Rick Jeliffe's home page with the reference implementation, Probatron) or advertise proprietary software (Oxygen, XML Buddy, Topologi).

I am starting to think that investing in Schematron at this point might not be such a good idea.

[edit 1, sknorr: libxml does have Schematron 1.5 support but it is not mentioned in the man page.] [edit 2, sknorr: lxml has ISO Schematron support which I overlooked initially.]

tomschr commented 7 years ago

libxml (i.e. xmllint, xsltproc & lxml) do not support Schematron

Actually, this is not quite true. There is the option --schematron. However, as far as I can see, you can only use Schematron 1.5 with that. So in a way, you can say libxml "supports" Schematron---although I wouldn't say nicely.

I wouldn't consider this a valid alternative...

tomschr commented 7 years ago

I think the best approach would be to write a wrapper in Python using lxml library. This library supports ISO Schematron.

A quick fix reveals some nice features:

from lxml import isoschematron
from lxml import etree

# Create a Schematron parser:
sch_doc = etree.parse("geekodoc5.sch")
schematron = isoschematron.Schematron(sch_doc)

# Parse our DocBook5 source:
doc = etree.parse("foo.xml")
schematron.validate(doc)
# => False

print(schematron.error_log)
# => Prints an extensive error log (XML) which can be parsed

I think, this can be easily created into a small Python "Schematron validation script". ;-)

tomschr commented 7 years ago

[...] I am starting to think that investing in Schematron at this point might not be such a good idea.

Yes, I can understand that you get this impression. I've recently discovered this 404 page as well. Not sure why this isn't available anymore. Nevertheless, I don't think it is that bad. As I've shown in my earlier post, it can be used in lxml, with some minimal scripting efforts.

All in all, I don't think this is something I would abandon Schematron at this stage. Of course, if lxml reveals some technical problems. we will need to think again.

tomschr commented 7 years ago

Apart from my last comment, we should add specific rules depending on GeekoDoc and our styleguide.

Definitions

I would suggest to distinguish between "hard" and "soft" rules:

Hard rules are "must have" rules; if the result is false, these rules issue an error warning and abort the validation.
Soft rules are recommendations. They issue informative messages, but don't break nor abort the validation.

Hard Rules

Import/check against the rules from docbook.sch (upstream DocBook).
Check for more than 1 step inside a procedure.
Check for more than 1 listitem inside orderelist or itemizedlist.
Check for more than 1 varlistentry inside variablelist.
Check for more than 1 member inside a simplelist.

Soft Rules

Check if you have more than 10 steps inside a procedure.
Check for a title inside admonition elements (note, tip, warning).
Check for specific rules following xml:id attributes.
Check for lonely sections(?)

Probably I miss other rules.

ghost commented 7 years ago

toms wrote...

Check for more than 1 listitem inside orderelist or itemizedlist.

Check for more than 1 varlistentry inside variablelist.

Both of those rules are good ways to make our "documentation updates" sections fail validation... :/

tomschr commented 7 years ago

Both of those rules are good ways to make our "documentation updates" sections fail validation... :/

Ahh, right! Ok, we could move these from hard to soft rules. I just try to collect some examples...

ghost commented 7 years ago

As I said somewhere above: within tables, counting the actual columns v/ columns set up via colspec would be great. And there are more issues concerning tables that should make validation fail but don't: such as bad column name references etc.

We could also check for spaces in ID attributes, such as in e.g. xml:id=" foo.bar" which will also go through current validation unhindered but fail when building HTML or PDF.

These would also give us added value as opposed to reimplementing something that is already covered by SDSC.

tomschr commented 7 years ago

counting the column numbers of tables v/ within colspec would be great. And there are more issues concerning tables that should make validation fail but don't: such as the column name references etc.

Well, we could check if the value of @cols and the number of colspec elements are the same. That is easy. Also checking column name references shouldn't be too hard. I'll add that into our list.

However, tables can get complicated when spanning a cell or row are involved.

We could also check for spaces in ID attributes

Great idea!

These would also give us added value as opposed to reimplementing something that is already covered by SDSC.

But don't we want to move these parts into the Schematron schema?

tomschr commented 7 years ago

Moved the list of checks into original description.

tomschr commented 7 years ago

From https://github.com/openSUSE/geekodoc/issues/6#issuecomment-263288127, I've tried to create a script which can validate our (yet to be definied) Schematron schema. In the long run, the script can be integrated into daps (if not, it was a good exercise :grinning: ).

@sknorr: For a first draft, see https://github.com/openSUSE/schvalidator

openSUSE / geekodoc