nexusformat / definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
26 stars 55 forks source link

nxdlTypes: invalid XML schema #1368

Open woutdenolf opened 4 months ago

woutdenolf commented 4 months ago
import xmlschema
xmlschema.XMLSchema("nxdl.xsd")
Traceback (most recent call last):
  File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/schemas.py", line 1197, in _parse_inclusions
    self.include_schema(location, self.base_url)
  File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/schemas.py", line 1264, in include_schema
    schema = type(self)(
  File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/schemas.py", line 482, in __init__
    self.parse_error(e.reason or e, elem=e.elem)
  File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/xsdbase.py", line 196, in parse_error
    raise error
xmlschema.validators.exceptions.XMLSchemaParseError: character data between child elements not allowed:

Schema component:

  <xs:element xmlns:xs="http://www.w3.org/2001/XMLSchema" name="example">m^2</xs:element>

Path: /xs:schema/xs:simpleType[4]/xs:annotation/xs:documentation/xs:element[1]

Schema URL: file:///tmp/nexus_definitions/nxdlTypes.xsd

Origin URL: file:///tmp/nexus_definitions/nxdl.xsd
woutdenolf commented 4 months ago

This was created in https://github.com/nexusformat/definitions/pull/592 to make the unit examples machine readable. Unfortunately this is not a valid XML schema: https://www.w3schools.com/xml/el_documentation.asp

prjemian commented 4 months ago

Thanks. Mea culpa. I'll look at it later today.

On Sun, Mar 3, 2024, 8:49 AM Wout De Nolf @.***> wrote:

This was create in #592 https://github.com/nexusformat/definitions/pull/592 to make the unit example machine readable. Unfortunately this is not a valid XML schema: https://www.w3schools.com/xml/el_documentation.asp

— Reply to this email directly, view it on GitHub https://github.com/nexusformat/definitions/issues/1368#issuecomment-1975186799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMHN6DJOPZX22ZGW3ZLYWM2ALAVCNFSM6AAAAABED7TLG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVGE4DMNZZHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

prjemian commented 4 months ago

At issue here is the use of xs:element element(s) within an xs:documentation element. The error is due to the presence of text content in the xs:element element: https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/nxdlTypes.xsd#L87-L95

The intent of PR #592:

makes the examples machine-readable

The examples should be re-factored to make the XML Schema file valid, yet allow the units to be machine-readable.

prjemian commented 4 months ago

@woutdenolf : Why are you convinced the XML Schema file nxdlTypes.xsd is not a valid XML Schema file?

The XML Schema files (nxdl.xsd and nxdlTypes.xsd) are checked (by the lxml package) each time they are loaded in the current unit testing: https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/tests/test_nxdl.py#L30 https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/nxdl/syntax.py#L23 https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/nxdl/syntax.py#L12 https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/globals/directories.py#L40C30-L40C44 https://github.com/nexusformat/definitions/blob/4c09c7718c41dc90eb996475efdf1c0d30fb1d5d/dev_tools/globals/directories.py#L116

When I load this file into a XML Schema editor in a new installation of the eclipse IDE (eclipse.org), the validator for such XML Schemas does not show any errors on these xs:element elements within a xs:document element. (Errors would be flagged by a red icon, all the icons are green) image

The W3Schools page you referenced state that a xs:documentation element can contain

Any well-formed XML content

prjemian commented 4 months ago

On the xmlschema page on PyPI, there is an example using XSD 1.1. It shows the exact same error as you described above. According to the docs, you are using the XSD 1.0 support.

Would it help to add an additional attribute of type="xs:string" to each of the xs:elements elements?

                <xs:element name="example">m^2</xs:element>
                <xs:element name="example">barns</xs:element>

would change to

                <xs:element name="example" type="xs:string">m^2</xs:element>
                <xs:element name="example" type="xs:string">barns</xs:element>
woutdenolf commented 4 months ago

Adding type="xs:string" does not help

Traceback (most recent call last):
  File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/schemas.py", line 1197, in _parse_inclusions
    self.include_schema(location, self.base_url)
  File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/schemas.py", line 1264, in include_schema
    schema = type(self)(
  File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/schemas.py", line 482, in __init__
    self.parse_error(e.reason or e, elem=e.elem)
  File "/home/denolf/virtualenvs/nexus/lib/python3.10/site-packages/xmlschema/validators/xsdbase.py", line 196, in parse_error
    raise error
xmlschema.validators.exceptions.XMLSchemaParseError: character data between child elements not allowed:

Schema component:

  <xs:element xmlns:xs="http://www.w3.org/2001/XMLSchema" name="example" type="xs:string">rad</xs:element>

Path: /xs:schema/xs:simpleType[2]/xs:annotation/xs:documentation/xs:element

Schema URL: file:///tmp/nexus_definitions/nxdlTypes.xsd

Origin URL: file:///tmp/nexus_definitions/nxdl.xsd

Also using XMLSchema11 does not change anything

import xmlschema
xmlschema.XMLSchema11("nxdl.xsd")

I believe we are using "1.0" anyway

<?xml version="1.0" encoding="UTF-8"?>

Why are you convinced the XML Schema file nxdlTypes.xsd is not a valid XML Schema file?

"Convinced" is a strong word. The validator I used said character data between child elements not allowed so I thought there was an issue. But maybe there is an issue with the validator. I'll investigate.

Any well-formed XML content

I did not understand what that meant.

prjemian commented 4 months ago

What about xmlschema.XMLSchema("nxdl.xsd", validation="lax")? The default is strict. I'm thinking the xmlparser package is different (and more strict) than the other two tools I showed above.

Well-formed XML content adheres to the XML standard:

  • XML documents must have a root element
  • XML elements must have a closing tag
  • XML tags are case sensitive
  • XML elements must be properly nested
  • XML attribute values must be quoted

Valid XML means additional rules must be met. Valid XML must adhere to the DTD or schema. In this case, it is the schema for XML Schema files.

woutdenolf commented 4 months ago

With validation="lax" the exception is indeed not raised. I asked the xmlschema developers what rule character data between child elements not allowed refers too.

Thanks for looking into this @prjemian !

prjemian commented 4 months ago

Their example suggested the root cause to me. In their example, which produced the same error message, the problem was produced from the xs:element when the type was not specified.

As I understand it, type defaults to xs:string. Perhaps the strict mode (in xmlparser) does not assume this default? Defining the type did not solve the problem but dropping the mode to lax ignores this problem.

woutdenolf commented 4 months ago

It turns out to be an error in the validator (mixed vs. element-only). Sorry for the noise!

prjemian commented 4 months ago

Thanks for testing our foundation structures!

woutdenolf commented 4 months ago

In fact there is a problem as pointed out here: https://github.com/sissaschool/xmlschema/issues/390#issuecomment-1986452747

It is not xs:documentation which is wrong (it can have elements) but it is xs:element itself (cannot have text).

prjemian commented 3 months ago

A suggestion at today's telco was to define an example element for this use. The example element would have a namespace specific to this XML Schema (and not visible externally, such as from .nxdl.xml files).

Keep in mind the goal is to provide machine readable example(s) for each of the unit types. The Python code that creates documentation for each of the NXDL classes is the first consumer of these examples. The NeXusOntology is another potential consumer of these examples.

prjemian commented 3 months ago

Whatever suggested remedies to this issue should be tested with both the libxml2 validation tools and the xmlschema package. There seem to be differences in the validation rules of these two XML tool chains.

paulmillar commented 3 months ago

Just to add some further thoughts:

The stated goal here is to provide information that is machine actionable; i.e., not for human consumption. This is rather at odds with the definition of <xs:documentation>, which is defined as content for human consumption.

XML Schema also supports <xs:appinfo> elements (as a child of <xs:annotation>). Unlike <xs:documentation>, the <xs:appinfo> element is intended for applications to process. Therefore, I'm wondering whether <xs:appinfo> might be a better fit.

I'm also wondering if these "examples" are really doing more work than providing examples of valid units. Perhaps these are really intended to provide canonical representations of specific units. For example, providing the example "angstrom" suggests that the unit attribute is written angstrom (and not angstroms, ångström or Å). Similarly, add an example could define whether 10^-10 meters is written nanometer or nanometre.

Such a list of canonical forms would be an open enumeration (other units are still accepted). We would want to document the canonical form for certain well-known units. Also, using a non-canonical form might trigger a warning by a validator (as hinted by the pull-request)

Such a list of canonical forms might be defined through an <xs:appinfo>; for example:

<xs:simpleType name="NX_WAVELENGTH">
    <xs:annotation>
        <xs:documentation>units of wavelength</xs:documentation>
        <xs:appinfo>
            <nxdl-def:values>
                <nxdl-def:value>angstrom</nxdl-def>
                <nxdl-def:value>nanometre</nxdl-def>
            </nxdl-def:values>
        </xs:appinfo>
    </xs:annotation>
    <xs:restriction base="xs:string"/>
</xs:simpleType>

However, since this is really just an enumeration. We could define the values using standard XML Schema terms; for example:

<xs:simpleType name="NX_WAVELENGTH">
    <xs:annotation>
        <xs:documentation>units of wavelength</xs:documentation>
    </xs:annotation>

    <xs:union>
        <xs:simpleType>
            <xs:restriction base="xs:string">
                <xs:enumeration value='angstrom'>
                    <xs:annotation>
                        <xs:documentation>The non-SI unit equivalent to 10^-10 meters</xs:documentation>
                    </xs:annotation>
                </xs:enumeration>
                <xs:enumeration value='nanometre'>
                    <xs:annotation>
                        <xs:documentation>The SI unit equivalent to 10^-9 meters</xs:documentation>
                    </xs:annotation>
                </xs:enumeration>
            </xs:restriction>
        </xs:simpleType>

        <xs:simpleType>
            <xs:restriction base="xs:string">
                <xs:annotation>
                    <xs:documentation>Non-canonical unit. Please register with NIAC.</xs:documentation>
                    <xs:appinfo>
                        <nxdl-def:warn>Non canonical unit</nxdl-def:warn>
                    </xs:appinfo>
                <xs:annotation>
            </xs:restriction>
        </xs:simpleType>
    </xs:union>
</xs:simpleType>

The documentation could then auto-generate the list of canonical representations for units (along with human readable descriptions). It might be possible for a validator could issue a warning if a non-canonical form is used.