Open woutdenolf opened 9 months ago
This was created in https://github.com/nexusformat/definitions/pull/592 to make the unit examples machine readable. Unfortunately this is not a valid XML schema: https://www.w3schools.com/xml/el_documentation.asp
Thanks. Mea culpa. I'll look at it later today.
On Sun, Mar 3, 2024, 8:49 AM Wout De Nolf @.***> wrote:
This was create in #592 https://github.com/nexusformat/definitions/pull/592 to make the unit example machine readable. Unfortunately this is not a valid XML schema: https://www.w3schools.com/xml/el_documentation.asp
— Reply to this email directly, view it on GitHub https://github.com/nexusformat/definitions/issues/1368#issuecomment-1975186799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMHN6DJOPZX22ZGW3ZLYWM2ALAVCNFSM6AAAAABED7TLG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVGE4DMNZZHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
At issue here is the use of xs:element
element(s) within an xs:documentation
element. The error is due to the presence of text content in the xs:element
element: https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/nxdlTypes.xsd#L87-L95
The intent of PR #592:
makes the examples machine-readable
The examples should be re-factored to make the XML Schema file valid, yet allow the units to be machine-readable.
@woutdenolf : Why are you convinced the XML Schema file nxdlTypes.xsd
is not a valid XML Schema file?
The XML Schema files (nxdl.xsd
and nxdlTypes.xsd
) are checked (by the lxml
package) each time they are loaded in the current unit testing:
https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/tests/test_nxdl.py#L30
https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/nxdl/syntax.py#L23
https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/nxdl/syntax.py#L12
https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/globals/directories.py#L40C30-L40C44
https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/globals/directories.py#L116
When I load this file into a XML Schema editor in a new installation of the eclipse IDE (eclipse.org), the validator for such XML Schemas does not show any errors on these xs:element
elements within a xs:document
element. (Errors would be flagged by a red icon, all the icons are green)
The W3Schools page you referenced state that a xs:documentation
element can contain
Any well-formed XML content
On the xmlschema
page on PyPI, there is an example using XSD 1.1. It shows the exact same error as you described above. According to the docs, you are using the XSD 1.0 support.
Would it help to add an additional attribute of type="xs:string"
to each of the xs:elements
elements?
<xs:element name="example">m^2</xs:element>
<xs:element name="example">barns</xs:element>
would change to
<xs:element name="example" type="xs:string">m^2</xs:element>
<xs:element name="example" type="xs:string">barns</xs:element>
Adding type="xs:string"
does not help
Traceback (most recent call last):
File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/schemas.py", line 1197, in _parse_inclusions
self.include_schema(location, self.base_url)
File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/schemas.py", line 1264, in include_schema
schema = type(self)(
File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/schemas.py", line 482, in __init__
self.parse_error(e.reason or e, elem=e.elem)
File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/xsdbase.py", line 196, in parse_error
raise error
xmlschema.validators.exceptions.XMLSchemaParseError: character data between child elements not allowed:
Schema component:
<xs:element xmlns:xs="http://www.w3.org/2001/XMLSchema" name="example" type="xs:string">rad</xs:element>
Path: /xs:schema/xs:simpleType[2]/xs:annotation/xs:documentation/xs:element
Schema URL: file:///tmp/nexus_definitions/nxdlTypes.xsd
Origin URL: file:///tmp/nexus_definitions/nxdl.xsd
Also using XMLSchema11
does not change anything
import xmlschema
xmlschema.XMLSchema11("nxdl.xsd")
I believe we are using "1.0"
anyway
<?xml version="1.0" encoding="UTF-8"?>
Why are you convinced the XML Schema file nxdlTypes.xsd is not a valid XML Schema file?
"Convinced" is a strong word. The validator I used said character data between child elements not allowed
so I thought there was an issue. But maybe there is an issue with the validator. I'll investigate.
Any well-formed XML content
I did not understand what that meant.
What about xmlschema.XMLSchema("nxdl.xsd", validation="lax")
? The default is strict
. I'm thinking the xmlparser
package is different (and more strict) than the other two tools I showed above.
Well-formed XML content adheres to the XML standard:
- XML documents must have a root element
- XML elements must have a closing tag
- XML tags are case sensitive
- XML elements must be properly nested
- XML attribute values must be quoted
Valid XML means additional rules must be met. Valid XML must adhere to the DTD or schema. In this case, it is the schema for XML Schema files.
With validation="lax"
the exception is indeed not raised. I asked the xmlschema
developers what rule character data between child elements not allowed
refers too.
Thanks for looking into this @prjemian !
Their example suggested the root cause to me. In their example, which produced the same error message, the problem was produced from the xs:element
when the type
was not specified.
As I understand it, type
defaults to xs:string
. Perhaps the strict
mode (in xmlparser) does not assume this default? Defining the type did not solve the problem but dropping the mode to lax
ignores this problem.
It turns out to be an error in the validator (mixed
vs. element-only
). Sorry for the noise!
Thanks for testing our foundation structures!
In fact there is a problem as pointed out here: https://github.com/sissaschool/xmlschema/issues/390#issuecomment-1986452747
It is not xs:documentation
which is wrong (it can have elements) but it is xs:element
itself (cannot have text).
A suggestion at today's telco was to define an example
element for this use. The example
element would have a namespace specific to this XML Schema (and not visible externally, such as from .nxdl.xml
files).
Keep in mind the goal is to provide machine readable example(s) for each of the unit types. The Python code that creates documentation for each of the NXDL classes is the first consumer of these examples. The NeXusOntology is another potential consumer of these examples.
Whatever suggested remedies to this issue should be tested with both the libxml2
validation tools and the xmlschema
package. There seem to be differences in the validation rules of these two XML tool chains.
Just to add some further thoughts:
The stated goal here is to provide information that is machine actionable; i.e., not for human consumption. This is rather at odds with the definition of <xs:documentation>
, which is defined as content for human consumption.
XML Schema also supports <xs:appinfo>
elements (as a child of <xs:annotation>
). Unlike <xs:documentation>
, the <xs:appinfo>
element is intended for applications to process. Therefore, I'm wondering whether <xs:appinfo
> might be a better fit.
I'm also wondering if these "examples" are really doing more work than providing examples of valid units. Perhaps these are really intended to provide canonical representations of specific units. For example, providing the example "angstrom" suggests that the unit attribute is written angstrom
(and not angstroms
, ångström
or Å
). Similarly, add an example could define whether 10^-10 meters is written nanometer
or nanometre
.
Such a list of canonical forms would be an open enumeration (other units are still accepted). We would want to document the canonical form for certain well-known units. Also, using a non-canonical form might trigger a warning by a validator (as hinted by the pull-request)
Such a list of canonical forms might be defined through an <xs:appinfo>
; for example:
<xs:simpleType name="NX_WAVELENGTH">
<xs:annotation>
<xs:documentation>units of wavelength</xs:documentation>
<xs:appinfo>
<nxdl-def:values>
<nxdl-def:value>angstrom</nxdl-def>
<nxdl-def:value>nanometre</nxdl-def>
</nxdl-def:values>
</xs:appinfo>
</xs:annotation>
<xs:restriction base="xs:string"/>
</xs:simpleType>
However, since this is really just an enumeration. We could define the values using standard XML Schema terms; for example:
<xs:simpleType name="NX_WAVELENGTH">
<xs:annotation>
<xs:documentation>units of wavelength</xs:documentation>
</xs:annotation>
<xs:union>
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value='angstrom'>
<xs:annotation>
<xs:documentation>The non-SI unit equivalent to 10^-10 meters</xs:documentation>
</xs:annotation>
</xs:enumeration>
<xs:enumeration value='nanometre'>
<xs:annotation>
<xs:documentation>The SI unit equivalent to 10^-9 meters</xs:documentation>
</xs:annotation>
</xs:enumeration>
</xs:restriction>
</xs:simpleType>
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:annotation>
<xs:documentation>Non-canonical unit. Please register with NIAC.</xs:documentation>
<xs:appinfo>
<nxdl-def:warn>Non canonical unit</nxdl-def:warn>
</xs:appinfo>
<xs:annotation>
</xs:restriction>
</xs:simpleType>
</xs:union>
</xs:simpleType>
The documentation could then auto-generate the list of canonical representations for units (along with human readable descriptions). It might be possible for a validator could issue a warning if a non-canonical form is used.