openSUSE / geekodoc

RELAX NG Schema for SUSE Documentation
https://opensuse.github.io/geekodoc
GNU General Public License v3.0
4 stars 5 forks source link

Investigate additional Schematron Rules for GeekoDoc #6

Open tomschr opened 7 years ago

tomschr commented 7 years ago

In openSUSE/suse-doc-style-checker#117, I raised the question if a Schematron schema could be useful for SDSC. The same question can be asked for GeekoDoc as well.

A Schematron schema can be used in two ways:

The validation procedure would be different:

Rick Jelliffe, the inventor of Schematron, describe the language as "a feather duster to reach the parts other schema languages cannot reach". ;-)

Benefits

Schematron Versions

Currently, there are two versions of Schematron:

Tools

Schematron validation are supported by:

See also

Personal

From my perspective, I prefer the separate Schematron schema (assuming all is possible, feasible, or useful). It seems, this doesn't introduce too many changes and gives greater flexibility.

I see it more as a "conformance and consistency" check rather than a hard validation. Of course, the rules shouldn't bother our writers too much.

Maybe we should also (re?)think about our definition of "validity/validation".

--

Update: List of Checks

Hard Rules

Soft Rules


@sknorr I've separated the discussion in SDSC from the GeekoDoc aspect. Feel free to comment. :)

ghost commented 7 years ago

I guess adding this to GeekoDoc might be the better idea for the time being...

For an idea of what we could do with Schematron directly in GeekoDoc, see: https://github.com/openSUSE/suse-xsl/issues/222 . There is quite a number of cases associated with table markup and you generally notice those issues currently when going the step from FO->PDF because FOP balks.

This is also not really style checker territory because it really leads to hard errors that are not caught by current validation methods. Then again, if we have more such cases, we could move some checks from the style checker to GeekoDoc.

tomschr commented 7 years ago

DocBook >= 5.0 brings also some (ISO) Schematron files, see /usr/share/xml/docbook/schema/sch/5.1/docbook.sch. For example, it checks if footnote contains another footnote child.

However, it seems, oXygen is not that happy with the schema. It shows this error message:

cvc-complex-type.3.2.2: Attribute 'name' is not allowed to appear in element 's:pattern'.

This is the respective line:

<s:pattern name="Glossary 'firstterm' type constraint">

which should be corrected like this:

<s:pattern>
    <s:title>Glossary 'firstterm' type constraint</s:title>
ghost commented 7 years ago

The tools side of Schematron seems to be interesting ...

Websites related to Schematron are also interesting: They seem to either show lots of 404 errors (schematron.com has a working front page but all sub pages 404), lead to ad farms (Rick Jeliffe's home page with the reference implementation, Probatron) or advertise proprietary software (Oxygen, XML Buddy, Topologi).

I am starting to think that investing in Schematron at this point might not be such a good idea.

[edit 1, sknorr: libxml does have Schematron 1.5 support but it is not mentioned in the man page.] [edit 2, sknorr: lxml has ISO Schematron support which I overlooked initially.]

tomschr commented 7 years ago

libxml (i.e. xmllint, xsltproc & lxml) do not support Schematron

Actually, this is not quite true. There is the option --schematron. However, as far as I can see, you can only use Schematron 1.5 with that. So in a way, you can say libxml "supports" Schematron---although I wouldn't say nicely.

I wouldn't consider this a valid alternative...

tomschr commented 7 years ago

I think the best approach would be to write a wrapper in Python using lxml library. This library supports ISO Schematron.

A quick fix reveals some nice features:

from lxml import isoschematron
from lxml import etree

# Create a Schematron parser:
sch_doc = etree.parse("geekodoc5.sch")
schematron = isoschematron.Schematron(sch_doc)

# Parse our DocBook5 source:
doc = etree.parse("foo.xml")
schematron.validate(doc)
# => False

print(schematron.error_log)
# => Prints an extensive error log (XML) which can be parsed

I think, this can be easily created into a small Python "Schematron validation script". ;-)

tomschr commented 7 years ago

[...] I am starting to think that investing in Schematron at this point might not be such a good idea.

Yes, I can understand that you get this impression. I've recently discovered this 404 page as well. Not sure why this isn't available anymore. Nevertheless, I don't think it is that bad. As I've shown in my earlier post, it can be used in lxml, with some minimal scripting efforts.

All in all, I don't think this is something I would abandon Schematron at this stage. Of course, if lxml reveals some technical problems. we will need to think again.

tomschr commented 7 years ago

Apart from my last comment, we should add specific rules depending on GeekoDoc and our styleguide.

Definitions

I would suggest to distinguish between "hard" and "soft" rules:

Hard Rules

Soft Rules

Probably I miss other rules.

ghost commented 7 years ago

toms wrote...

  • Check for more than 1 listitem inside orderelist or itemizedlist.
  • Check for more than 1 varlistentry inside variablelist.

Both of those rules are good ways to make our "documentation updates" sections fail validation... :/

tomschr commented 7 years ago

Both of those rules are good ways to make our "documentation updates" sections fail validation... :/

Ahh, right! Ok, we could move these from hard to soft rules. I just try to collect some examples...

ghost commented 7 years ago

As I said somewhere above: within tables, counting the actual columns v/ columns set up via colspec would be great. And there are more issues concerning tables that should make validation fail but don't: such as bad column name references etc.

We could also check for spaces in ID attributes, such as in e.g. xml:id=" foo.bar" which will also go through current validation unhindered but fail when building HTML or PDF.

These would also give us added value as opposed to reimplementing something that is already covered by SDSC.

tomschr commented 7 years ago

counting the column numbers of tables v/ within colspec would be great. And there are more issues concerning tables that should make validation fail but don't: such as the column name references etc.

Well, we could check if the value of @cols and the number of colspec elements are the same. That is easy. Also checking column name references shouldn't be too hard. I'll add that into our list.

However, tables can get complicated when spanning a cell or row are involved.

We could also check for spaces in ID attributes

Great idea!

These would also give us added value as opposed to reimplementing something that is already covered by SDSC.

But don't we want to move these parts into the Schematron schema?

tomschr commented 7 years ago

Moved the list of checks into original description.

tomschr commented 7 years ago

From https://github.com/openSUSE/geekodoc/issues/6#issuecomment-263288127, I've tried to create a script which can validate our (yet to be definied) Schematron schema. In the long run, the script can be integrated into daps (if not, it was a good exercise :grinning: ).

@sknorr: For a first draft, see https://github.com/openSUSE/schvalidator