qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

Multiple Schemas #451

Open michaelhkay opened 1 year ago

michaelhkay commented 1 year ago

There are many situations in which a single transformation wants to deal with multiple schemas: for example when transforming from v1 of some industry standard to v2 of the same standard, or when processing a collection of input documents each of which references its own schema using xsi:schemaLocation.

This is currently possible only if the schemas are compatible (that is, if the union of the schemas is itself a valid schema). And even where it is possible, validation against the union of S1 and S2 may produce a different outcome from validation against S2, for example because a strict wildcard allows content that S2 would not allow. Substitution groups are a particular problem: if v1 and v2 have elements with different substitution group membership, then validating against the union of v1 and v2 allows the union of the substitution groups, which means that you haven't actually verified that the result document is valid against v2.

The problem is confounded by considerations that are outside the scope of the spec. What happens when you run two different stylesheets against the same source document? If the source document has been validated against S1, this means that both stylesheets must use schemas that are supersets of S1. The way this requirement is managed in Saxon is to introduce the concept of a Configuration in which transformations run; a Configuration has a single schema, and all source documents and stylesheets within the Configuration must use compatible subsets of this schema. A source document validated using one Configuration cannot be used in a different Configuration, because the type annotations would be meaningless against a different schema.

My proposal is to introduce the idea of a named schema (that is, a named collection of schema components). When we do xsl:import-schema, we can give the imported schema a name, and there is no requirement that the components in this schema should be compatible with the components in any other schema. When we refer to a schema type (for example in $s cast as QName) we should be able to qualify the type name with a schema name (we can postpone discussions of syntax, let's say cast as my:part-number§v1 for now). When we request validation, we should be able to nominate the schema to be used for validation, for example <xsl:element name="e" validation="strict" schema="v2">.

The trickiest part is handling source documents, mainly because validation of source documents (especially those read using doc() or collection()) is at present almost entirely implementation-defined. I believe that we need explicit options to request validation of source documents against a specific schema. There should also be an option to validate a document against the schema identified in its own xsi:schemaLocation, in which case there should be no requirement that that schema is compatible with any schema known statically to the stylesheet.

michaelhkay commented 1 year ago

Let's try a more detailed sketch of the proposal.

The static context is enhanced so it contains one unnamed schema and any number of named schemas. The schema name is an NCName.

The import schema declaration in XQuery and XSLT is enhanced so that you can import a schema and name it at the same time. Also you can simply import a schema by name without specifying a namespace or location hint; this works on the basis that you must have previously made the schema known to the XSLT/XQuery processor using some external API (e.g. by loading it into an XML database). If not preloaded in this way, schema names are local to a query or stylesheet; two different queries or stylesheets can use the same name to refer to different schemas, or different names to refer to the same schema.

In the SequenceType/ItemType syntax, any reference to a schema component (schema element or attribute name, schema type name) may be qualified by the schema name; I'm inclined to use the syntax SS/TT where SS is the schema name and TT is the type name.

The built-in types such as xs:integer are present in every schema and their names do not need to be qualified. Every other type belongs exclusively to one schema, even if multiple schemas are derived from the same source documents.

Where XSLT or XQuery syntax is used to invoke strict or lax validation of an instance document, the syntax is enhanced to allow a schema to be named. An option such as schema=#local is provided to indicate that the document should be validated against a schema identified using xsi:schemaLocation, which will be built as a free-standing schema and not interfere with any other schemas in use. This will result in the document having type annotations referring to types that are not in the static context and therefore cannot be referenced by name.

Functions like doc() and collection() are augmented with options to request validation against a specific schema (or a local schema).

michaelhkay commented 1 year ago

Note that XSLT currently says:

The schema components imported into different [packages] within a [stylesheet] must be consistent. Specifically, it is not permitted to use the same name in the same XSD symbol space to refer to different schema components within different packages; and the union of the schema components imported into the packages of a stylesheet must constitute a valid schema (as well as the set of schema components imported into each package forming a valid schema in its own right).

This definition is inadequate. Suppose package P imports namespace N, while package Q imports N and M. And suppose that M contains an element declaration F to be within the substitution group of an element E defined in N. Then in package Q, F is substitutable for E, while in package P it is not, which means that a element validated in package Q against a type T may be invalid against type T in package P. If a function in P is declared to expect an argument of type element(*, T), then it is wrong to assume that an element validated against type T in package Q can be passed across this interface.

(Saxon currently deals with this by using the union of all these schemas at run-time. But this isn't right either, because an element validated against this union schema may be invalid against a subset of the schema.)

One solution is to impose stronger constraints on the consistency of the schemas imported by the packages making up a stylesheet. As far as I'm aware the cases where the validity of an element against types in schema S is affected by adding components from another schema T include:

A rather heavy-handed way forward might be to define schemas as incompatible if they are affected by these issues.

A less draconian solution might be to say that a function expecting an instance of element(, T) has to satisfy itself that the supplied element is valid against type T as defined in the schema of the containing package*; the fact that the element was validated against type T in some other schema is not by itself proof of this. This may involve revalidation. But this raises questions about the type annotation of the revalidated node.

Currently validating a node involves copying it, to create a different node with different identity. Perhaps the proposal for issue #596 (pinned values) allows us to contemplate the idea of having two "annotated nodes" that share the same "node identity" but have different (or multiple) type annotations?

michaelhkay commented 1 year ago

There might be a better approach to this: when a stylesheet declares a function parameter of type element(*, T), the compiler cannot assume that the supplied element is valid against T as defined in that package's imported schema; it can only assume that it is valid against T in some schema that includes a "compatible" definition of the type T. The definitions of type T in two different schemas are not rendered incompatible by virtue of having different substitution groups, different types derived by extension, or additional declarations that might satisfy strict or lax wildcards. This means, for example, that the XSLT compiler cannot assume that a path such as element(*, T)//X will select nothing based on the imported schema for that package; it must take into account that a descendant named X might be permitted in some different but compatible schema.

Note that this doesn't just affect stylesheets with multiple packages, it affects any situation where the schema used to validate a source document differs in any way (including, for example, the use of xsi:schemaLocation) from the schema imported into a query or stylesheet.

michaelhkay commented 1 year ago

So sketching this out:

Two schemas [sets of schema components] A and B are compatible if for every QName that identifies global element or attribute declarations or global schema types existing in both A and B, the definitions of that component in the two schemas are compatible. For two schema components to be compatible, the properties of the schema components must be the same. Note that it is NOT required that every valid instance of a type T when assessed using schema A is also a valid instance of type T when assessed using schema B. For example, the effects of validating against type T may vary depending on substitution group membership, types derived by extension, or element declarations that satisfy lax or strict wildcards.

When a stylesheet or query uses an item type reference such as element(*, T) or schema-element(E), it cannot be assumed that the instances of that type have been validated using the schema defined by the static context of that item type reference; only that they have been validated using a schema that is compatible with that one.

michaelhkay commented 1 year ago

Note that XQuery (in §2.3.5) defines stronger constraints for cross-module schema consistency:

For a given query, define a participating ISSD as the [in-scope schema definitions]) of a module that is used in evaluating the query. If two participating ISSDs contain a definition for the same schema type, element name, or attribute name, the definitions must be equivalent in both ISSDs. In this context, equivalence means that validating an instance against type T in one ISSD will always have the same effect as validating the same instance against type T in the other ISSD (that is, it will produce the same PSVI, insofar as the PSVI is used during subsequent processing). This means, for example, that the membership of the substitution group of an element declaration in one ISSD must be the same as that of the corresponding element declaration in the other ISSD; that the set of types derived by extension from a given type must be the same; and that in the presence of a strict or lax wildcard, the set of global element (or attribute) declarations capable of matching the wildcard must be the same.

In practice, it is very hard to satisfy these constraints unless all modules use exactly the same schema (and unless validation of instance documents also uses that schema). (In Saxon, all modules do instance validation against the union of all the imported schemas, though the scope of names used in each module is confined to the schema components imported by the specific module.)

I propose to loosen the constraints as described in previous comments. Something like:

For a given query, define a participating ISSD as the [in-scope schema definitions]) of a module that is used in evaluating the query. If two participating ISSDs contain a definition for the same schema type name, element name, or attribute name, the definitions must be compatible in both ISSDs. In this context, "compatible" means that the corresponding schema component has the same properties, and that the schema components that it transitively refers to have the same properties. Note that compatibiliity does not imply that validating an instance against type T in one ISSD will always have the same effect as validating the same instance against type T in the other ISSD. For example, that the membership of the substitution group of an element declaration in one ISSD may be different from that of the corresponding element declaration in the other ISSD; the set of types derived by extension from a given type may differ; and in the presence of a strict or lax wildcard, the set of global element (or attribute) declarations capable of matching the wildcard may differ.

A processor can safely assume that if two schemas contain a type T that is derived from the same <xs:complexType> element in the same schema document, without any use of xs:redefines or xs:override, then the two definitions of T are compatible. Moreover, a processor may assume the converse: that if the two types are not derived from the same element in the same document, then they are not compatible. However, a processor MAY at its discretion perform a more careful analysis to establish component compatibility where this condition is not satisfied.

This means, for example, that in a module containing a public function that expects an argument of type schema-element(E), the module must accept an element node that has been validated against the element declaration E using any schema that contains this declaration. Unless the schema declaration of E prohibits substitution, any element defined in any compatible schema as being in the substitution group of E must be accepted, whether or not that module's ISSD includes that element in the substitution group.

There's still a bit of a loose end here. To check whether an element F is a valid instance of schema-element(E), are we expected to look at the schema against which F was validated, to see whether F was included in the substitution group of E in that schema? If so, that has implications on the data model. We're a bit vague as to exactly what information is available from the type annotation on a node. Is it a schema type, or just a type name, or is it a (schema type, schema) pair?

michaelhkay commented 1 year ago

I wrote: "As far as I'm aware the cases where the validity of an element against types in schema S is affected by adding components from another schema T include:..."

For the record I found another case: the outcome of validating an element that uses type alternatives (conditional type assignment) may depend on whether attributes of an ancestor element are declared to be inheritable, which may vary from one schema to another.

ChristianGruen commented 11 months ago

Partially resolved (#635); “PR Pending” removed.