redhat-developer / vscode-xml

Editing XML in Visual Studio Code made easy
Eclipse Public License 2.0
251 stars 76 forks source link

Support Schematron validation #451

Open tomschr opened 3 years ago

tomschr commented 3 years ago

Situation

The README lists some validation with XSD and DTD. However, currently Schematron validation in combination of RELAX NG (RNG) is not supported.

Proposed solution

Schematron rules should be recognized and supported in other schema languages like RNG or XSD.

Relationships

datho7561 commented 3 years ago

I am interested in implementing this. If we are not able to support XSD 1.1, this would be a viable way to support some form of assert for XML documents.

It might make sense to include it as an extension to vscode-xml instead of as a part of the base functionality. Not every user will need the Schematron support, and moving the Schematron parts to another extension will reduce the total download for users who don't need Schematron. However, I think its worth seeing the size increase in the LemMinX Jar and binary first before coming to a decision on this.

angelozerr commented 3 years ago

@datho7561 IMHO I think we should start implementing RelaxNG support in https://github.com/redhat-developer/vscode-xml/issues/450 before Schematron. In other words, finish and merge the PR https://github.com/eclipse/lemminx/pull/841 which provides basic support for RelaxNG (just validation without error range).

This PR uses jing and it seems jing provides support for Schematron too.

svanteschubert commented 3 years ago

RelaxNG is just another grammar aside of DTD and W3C Schema XML and embraced by Sun's Multi Schema Validator. RelaxNG is the most powerful of the three major grammars (see http://pike.psu.edu/publications/toit05.pdf) All the above grammars are supported by Sun's Multi Schema Validator, which embraces/supports all of them: https://xmlark.github.io/msv/docs/nativeAPI.html Therefore I would suggest considering MSV! I have overtaken Maven release rights from my former colleague Kohsuke Kawaguchi and have to finish a Maven release (https://github.com/xmlark/msv/tree/build-refactoring) and likely repo movement to The Document Foundation (TDF) where LibreOffice is hosted and the ODF Toolkit I am maintaining. I am the OASIS ODF TC co-chair but on a different leg started yesterday implementing on an opensource syntax binding editor based on your extension for the NGI DAPSI project: https://dapsi.ngi.eu/hall-of-fame/idiss/ Perhaps we can find some synergies over the next months ;-)

angelozerr commented 3 years ago

Therefore I would suggest considering MSV!

Thanks so much @svanteschubert for your comment! We have opened to provide RelaxNG support with any library. Jing seemed is to be the better Java library for RelaxNG, but to be honnest with you I didn't know MSV.

If we wish to consume MSV the first question is what about license? If MSV is not compatible with EPL 2.0 we cannot use it.

Perhaps we can find some synergies over the next months ;-)

It should be fantastic if we could work together to integrate MSV (if license is compatible with EPL 2.0) inside LemMinx. Do you think you could contribute to LemMinx for the RelaxNG support?

svanteschubert commented 3 years ago

@angelozerr The license is BSD. Oracle has abandoned the project and I have forked officially&legally from the available sources of their last Maven release: https://search.maven.org/artifact/net.java.dev.msv/msv/2013.6.1/pom Technically, I was not aware of the Oracle brunch so started to work on https://github.com/kohsuke/msv That is the reason the copyright header is being added/merged in later. The result will be equivalent in the end! I believe the fix of copyright headers is the final task before the Maven MSV release (and the repo swift to TDF, RedHead or Eclipse) ;-)

You may see the MSV Maven Oracle history at https://search.maven.org/artifact/net.java.dev.msv/msv

Eclipse mentioned in their FAQ that BSD is allowed to change the license, so I believe we are safe: https://www.eclipse.org/legal/epl-2.0/faq.php#h.nzy2s8vsuxe2clarify

Regarding your attempt at shanghaiing me - I guess you have success ;-) I only put my toe into the project this Monday but I would say tentatively yes, but it would help if I could team up later with one of you to have feedback in principle design decision beforehand and not during the pull request review ;-) Just started and soon summer vacation starts. Would do a new MSV release first to have a stable base and paste here the design suggestions after I have a better understanding of the status quo.

This might be the beginning of a long friendship :)

angelozerr commented 3 years ago

I only put my toe into the project this Monday but I would say tentatively yes, but it would help if I could team up later with one of you to have feedback in principle design decision beforehand and not during the pull request review ;-)

At first we must check MSV can be consumed in LemMinX (it should because BSD and ASF should be compatible with EPL). After that I suggest you provide

For validation, I suggest you that you see my draft PR to support RelaxNG with Jing https://github.com/eclipse/lemminx/pull/841 and adapt it to consume MSV. The hard part in validation is that you should higlight the error as range and not just a given offset. Once validation will be finished, you will need to write test. See https://github.com/eclipse/lemminx/blob/master/org.eclipse.lemminx/src/test/java/org/eclipse/lemminx/extensions/contentmodel/XMLSchemaDiagnosticsTest.java

I suggest you read https://github.com/eclipse/lemminx/blob/master/docs/LemMinX-Extensions.md

If you have another question don't hesitate to ask.

This might be the beginning of a long friendship :)

I hope :)

Oobiewan commented 1 year ago

Hi All, I am a bit confused about the scope of this issue, is the goal to support schematron for RELAX NG only? Does the extension support schematron for DTD already? I cannot find any mentions of schematron in the description of the extension. Thanks, Benedek

svanteschubert commented 1 year ago

All right to be confused. IMHO we talked about Schema, not Schematron. :-) DTD, XSD, and RNG have all in common to define a relation between parents and their children (attribute, child elements and content - and their datatypes). Schematron was invented as there are also constraints across the complete full document (let's call them business rules). For example, the invoice net amount of each invoice line of an invoice XML file has to add up to the complete invoice net amount. Schematron is therefore also used for the EU e-invoice validator: https://github.com/ConnectingEurope/eInvoicing-EN16931/blob/master/ubl/schematron/UBL/EN16931-UBL-model.sch

The Schematron technology is quite linked to XSLT. These rule files are being transformed into XSLT files doing the validation. Philip Helger offers some Java-based software for schematron that is also being used in the EU Validator.

Hope I could help Svante

datho7561 commented 1 year ago

I am a bit confused about the scope of this issue,

This issue is for tracking Schematron schema (*.sch) support in lemminx. I believe that you can reference a Schematron file using <?xml-model href="..."?> in any XML document, so that you can validate against a .sch schema, along side other schemas (XSD, DTD, RNG once it's provided).

I started a POC at https://github.com/datho7561/lemminx-schematron based on SchXslt. The issue, as Svante mentioned, is that Schematron is based on XSLT, so we need a library to handle XSLT. My POC is based off of Saxon-HE for the XSLT processing. The issue with this is that Saxon is very large. The main alternative I've look into, Xalan, has some bugs that prevent it from working with SchXslt. I tried building Xalan from sources, but I ran into some issues. Since Schematron is not a priority for us at the moment, I don't have time to work on the POC.

Oobiewan commented 1 year ago

@datho7561 I see, that's a pity, and I'm sorry I don't have the skills to help. Thanks a lot for explaining. @svanteschubert Thanks for the reply. I see that you did not talk about Schematron, but this issue is about Schematron support according to the title and @datho7561. Now I have the feeling that the issue reporter didn't really know the difference between Schematron and validating against a schema. I just thought it might help everyone involved if we clarify the target here.

pgundlach commented 1 year ago

@Oobiewan I am very sure that the issue reporter knows the difference between Schema and Schematron as he is an expert in the field (although I know that even experts have bad days).

wendellpiez commented 1 year ago

Observations, hopefully clarifying:

  1. As was noted, Schematron can be used as a standalone validation technology alongside or even instead of other schemas including RNG, XSD and DTD. For example, some systems may read <?xml-model ... ?> from documents to discover appropriate Schematron to apply in addition to or in lieu of other validation. This is useful but does not describe every application of Schematron. Schematron can also be embedded inside XSD or RNG (and cf XSD 1.1 assert as mentioned by @datho7561), providing Schematron (rules-based) checking along with the grammar-based checking described by @svanteschubert.
  2. The XSLT dependency also mentioned by @datho7561 only applies in certain cases (albeit on this planet, most/all), depending in part on a setting (/schema/@queryBinding) inside the Schematron. Only xslt1 Schematrons will work under Xalan (an XSLT 1.0 processor) even given an XSLT 1.0 Schematron transpiler. These days there are also use cases for xslt2 or better, which entails an XSLT 2.0 processor or better (hence Saxon).

(Incidentally, https://github.com/schxslt/schxslt/tree/master/core/src/main/resources/xslt does show 1.0 code so it is possible this is a supported use case....)

Thus, the question (IMV) of whether/when to support external Schematron (as a useful validation option) vs embedded Schematron (for schemas in RNG and XSD that embed Schematron rules) can be separated from the question of which version(s) of Schematron to support, with which dependencies. For less effort, support only Schematrons external to other schemas, not embedded and not in/as XSD 1.1 assert (which is similar but also its own thing with its own spec, indeed requiring XPath 2.0). Whether to limit to xslt1 or support xslt2 or better then becomes a question of what the installation can do.

It's very nice to see people using Schematron (the language where you get to write your own error messages).

angelozerr commented 1 year ago

Thanks @wendellpiez for your great feedback. Badly we have no time today to work on this support, if somebody want to work on this support I will be happy to help him.

If we see that we have more and more People Who want to have this support we could work on it.

Any contribution are welcome! We have no time to study schematron librairies but if someone can us provide a POC which manages schematron validation I could give us some help.