openSUSE / suse-doc-style-checker

Style Checker for SUSE Documentation
Other
2 stars 5 forks source link

Investigate whether Schematron is useful for SDSC #117

Closed tomschr closed 7 years ago

tomschr commented 7 years ago

During a question and answer session around KIWI, I could have a closer look at Schematron. :bulb:

This issue is just for the record so we don't forget it (apart from a high "geekiness factor" :smile: so be warned! :smile: )

Questions

  1. Should we use Schematron in our checks (in the future)?
  2. Should we replace our XSLT checks with Schematron rules?
  3. Is it possible? Feasible?
  4. Sidenote: should we use Schematron rules for GeekoDoc?

    Background and Terminology

    • Rule-based schema languages (Schematron) »making assertions about the presence or absence of patterns in XML trees« (Wikipedia). It is mostly used to verify data interdependencies (co-constraints), check data cardinality, and perform algorithmic checks (taken from xfont.com):
    • A co-constraint is a dependency between data within an XML document or across XML documents.
    • Cardinality refers to the presence or absence of data.
    • An algorithmic check determines data validity by performing an algorithm on the data.

    Bascially, the above items are all things that grammar-based schema languages can't do.

    • Grammar-based schema languages (DTD, XSD, RNG) A grammar-based schema language specifies the structure and contents of elements and attributes in an XML instance document. It's focus is more on parent-child relationships.

At the moment, no schema language combines all the two types.

Schematron Validation Workflow

When using a Schematron schema, the original implementation is based on XSLT. This implementation performs the following steps:

  1. Extract embedded Schematron rules from XML Schema or RELAX NG schema or read in a Schematron schema.
  2. Process inclusions.
  3. Process abstract patterns.
  4. Compile the Schematron schema to XSLT.
  5. Apply the compiled stylesheet from last step on the XML document.

Any errors are represented in ISO SVRL (Schematron Validation Report Language).

Benefits

Especially the first item about embedding Schematron rules can be very interesting: we could move some (basic) structural checks or other co-constraints to GeekoDoc.

This could simplify the style checker. On the other side, when validate with GeekoDoc plus embedded Schematron rules, writers would be notified earlier about some additional validation checks.

Surely, not all would be possible or feasible. Question is here, if we want it. However, I consider it out of scope for this issue (we should open a different GH issue for GeekoDoc).

Tools Support

DocBook 5 provides Schematron rules, be it embedded or separately. See /usr/share/xml/docbook/schema/sch/5.1/docbook.sch.


@sknorr I know, this is a huge issue and probably hard to digest. One of these crazy ideas from me. :grinning: This is a long term goal, of course.

Let me know what you think about. :-)

ghost commented 7 years ago

Using Schematron of course is a possiblity. However,

  1. There is no acute issue that makes SDSC's XSLT parts inadequate. (At least none that Schematron would solve.)
  2. Even with Schematron, all the text-based checks are sophisticated enough that you'd need extension functions to realize them (I think). That presumes that extension functions are even an option in Schematron. If extension functionality is not possible with Schematron, using it would mean using three technologies for checks (XSLT+Python+Schematron).
  3. We would be stuck with outputting messages in relation to where things are within a bigfile. We could never even begin to translate bigfile locations to smallfile locations. (Afaics)

So, for the time being, I think this would be a very large change without a huge benefit, except that I get to work with Schematron.

Additional things that might be harder with Schematron (afaict): Different levels of messages (info, warning, error). I am not totally married to that idea any more though because it means subjectively saying that some rules are less/more important than others which is sort of a hard problem.

In essence, I think there are better fish to try frying (NLP stuff, getting useful XML source file names and lines e.g.).

tomschr commented 7 years ago

(1) There is no acute issue that makes SDSC's XSLT parts inadequate. (At least none that Schematron would solve.)

This is true, of course. I thought it more of a long-term goal IF we find out Schematron would be more appropriate or better to maintain or...

(2) Even with Schematron, all the text-based checks are sophisticated enough that you'd need extension functions to realize them (I think).

I didn't tackle this one in my issue because my long text would become even longer. ;-) Yes, I thought of that too. I assume, the extension functions can be used in a Schematron schema as well. It's just a compiled XSLT stylesheet anyway.

(3) We would be stuck with outputting messages in relation to where things are within a bigfile. We could never even begin to translate bigfile locations to smallfile locations. (Afaics)

Not exactly sure about what this has anything to do with Schematron. We face this problem anyway, regardless of using Schematorn or not.

... be a very large change without a huge benefit ...

I see it more as a refactoring effort which we do anyway. Surely, this is a design decision which we shouldn't take lightly.

In essence, I think there are better fish to try frying (NLP stuff, getting useful XML source file names and lines e.g.).

Maybe. ;-) I see this topic interconnected with GeekoDoc. Therefor, I've opened a different issue in the GeekoDoc repo (see above).

For the time being, better we close this.

ghost commented 7 years ago

Not exactly sure about what this has anything to do with Schematron. We face this problem anyway, regardless of using Schematorn or not.

Within our current infrastructure, I see a way around this. With Schematron, given that the language has its own way of reporting errors, that would seem to become more problematic.