oasis-tcs / lexidma

OASIS Lexicographic Infrastructure Data Model and API (LEXIDMA) TC: A repository designed for use in development of TC chartered work products and test suites. https://github.com/oasis-tcs/lexidma
Other
7 stars 8 forks source link

validation artifacts for XML and JSON #113

Closed mjakubicek closed 5 months ago

mjakubicek commented 5 months ago

Comments on variants of schemas:

Each schema (XML and JSON) have two variants: one for documents implementing the Crosslingual Module (and possibly some other modules) and one for documents not implementing it (but possibly implementing some other modules).

Comments on fixes made in the attached example XML files (compared to what is in the repository at https://github.com/oasis-tcs/lexidma/tree/master/dmlex-v1.0/specification/examples/examples/source):

three sources of issues:

  1. lexicographicResource/title as element vs. as attribute
  2. attribute langCode missing in documents with entry as top-level document (where it is required)
  3. order of types among children different from order in the serialization (it is believed that if element order is often significant within the same element type, also the order of types in which they are listed in the serialiation should be preserved; it would not be possible to implement arbitrary order of types in which siblings of the same type stick together)

Following issues have been fixed in the provided files:

0.xml.xml contains lexicographicResource/title as element instead of attribute as in the XML Serialization

5.xml.xml reverses the order of partOfSpeechTag and inflectedFormTag, in contrast to the XML Serialization

7.xml.xml contains lexicographicResource/title as element instead of attribute as in the XML Serialization

8.xml.xml all instances of headwordTranslation are missing the (in this case required) attribute "langCode"

9.xml.xml all instances of headwordTranslation and headwordExplanation are missing the (in this case required) attribute "langCode"

10.xml.xml contains lexicographicResource/title as element instead of attribute as in the XML Serialization

14.xml.xml translationLanguage and entry are in inverse order when compared to the XML Serialization

20.xml.xml headwordTranslation is missing the (in this case required) attribute "langCode"

21.xml.xml headwordTranslation and exampleTranslation are missing the (in this case required) attribute "langCode"

21.xml.xml order of headwordTranslation and example differs from order in the XML Serialization

24.xml.xml order of etymonLanguage and etymonType differs from order in the XML Serialization


New comments on the PDF document (dmlex-v1.0-csd02.pdf), ordered by chapter numbers:

Some properties in the specification use the data type "string", e.g. 4.3.5 memberType has property "role" with data type "string" (unlike the following properties which hava data type "non-empty string")

4.3.5 there is no constraint establishing that "max" must be greater or equal to "min" – should it be added?

5.2.2.1 missing "transcriptionSchemeTags" in the list of "Members if implementing the Controlled Values Module"

in multiple places, where the specification/serialization says "number", apparently an integer is meant (such as "min" and "max" in memberType, or "startIndex" and "endIndex" in the various markers, and probably also obverseListingOrder)

5.1.2.25 (already reported:) relationType.@scopeRestriction should be OPTIONAL (like in JSONSchema and the model), not REQUIRED


Old (already reported) comments on the PDF document (dmlex-v1.0-csd02.pdf), ordered by chapter numbers:

2. posses [typo]

2. fargemnt [typo]

  1. Components fargemnt [typo]

3.1 uri REQUIRED (zero or one). [required element cannot occur zero times]

4.2.9 missing labelTypeTag among "property of" sameAs

4.3.5 In memberType, property "type" is defined as UNIQUE, which contradicts existing examples (such as 12.xml) and does not seem to make sense (e.g. why would it not be possible to define a relation between two entries or two senses?). In the proposed schema, the corresponding uniqueness constraint is present, but commented out (inactive) and marked as "possibly erroneous".

4.3.6 & A.1.17 & A.1.19 remainders of probably deprecated property memberRole (not defined anywhere)

5.1.2.1 transcriptionSchemeTag is not listed as possible child of lexicographicResource

5.1.2.15 XML element: is missing among child elements

5.4. relationhips [typo]

5.4. reational [typo]

5.4.3.1 includng all [typo]

jmccrae commented 5 months ago

A couple of issues:

blahma commented 5 months ago

Hello, this is Marek Blahuš from Lexical Computing. I have authored both the schemas (XML and JSON). Thank you for your feedback.

  • The spec does not state that XML elements are required to appear in any particular order. I would recommend using <xs:all> in place of <xs:sequence> throughout and reverting examples 5, 14, 21 and 24.

The reasons for which I thought it makes more sense that the XML elements appear in an order:

  1. It feels strage if order matters among elements of the same type (listingOrder is implicit from the XML serialization) but those elements can be freely intertwined with elements of other types. I believe that children of an element should be either order-sensitive or order-insensitive, but not something in between.
  2. Indeed, in XML Schema 1.0, no element may appear as child of <xs:all> more than once, i.e. the permissible values of minOccurs and maxOccurs for the child elements are 0 a 1. Most children of <lexicographicResource> are allowed to appear multiple times. I can see this constraint has been relaxed in XML Schema 1.1, on which I had to start relying during the schema design process anyway, so with this in mind, it is indeed possible to relax the prescribed order of elements, even if occuring multiple times.
  3. Using the <xs:all> model group is discouraged by XML Schema best practices: "When should I use <all> model group? Never. The <all> model groups' limited applicability and unexpected extension semantics should be avoided. Use a <sequence> instead." But perhaps this criticism is limited to XML Schema 1.0 only.
  4. Another motivation for using sequence was the intention to write a single schema with "switches" representing individual modules, which could be activated by the user at will during validation. In current absence of such a mechanism, decision was made to create two separate versions of the schema, which happen to suffice, because the modules define almost exclusively only optional extensions to existing elements (Crosslingual Module with translationLanguage being an exception). Even now, however, the present grouping in sequences, sometimes intentionally redundant, still makes orientation easier, e.g. if someone will want to strip the schema down for their own use which accepts only a subset of the available modules.

To conclude, I will provide an alternative version of the XML Schema(s) with <xs:sequence> replaced with <xs:all> where applicable.

  • The specification clearly states that that langCode may be omitted on headwordTranslation and exampleTranslation if it can be inferred from above. These schemas make it mandatory in all cases. This needs to be fixed (I am not sure if the condition in the spec is actually implementable in these schema languages). We should revert examples 8, 9, 14, 20 and 21. We should probably also have an example that is complete (i.e., starts with a <lexicographicResource>) to test this better.

I have double-checked and attribute langCode is required on lexicographicResource and optional on headwordTranslation, headwordExplanation and exampleTranslation in both the XML and JSON Schemas, exactly as mandated by the specification. For the latter three elements in XML Schema, an <xs:assert> makes sure that if the attribute is omitted, then there must be exactly one translationLanguage child of lexicographicResource. Similar check is implemented in the JSON Schema, this time within the definition of lexicographicResource, where a rather deep validating hierarchy implements the constraint that langCode be required if and only if there is not an exactly one translationLanguage below the lexicographicResource. In my opionion, therefore, the schemas correctly follow the specification in this regard.

The changes in examples 8, 9, 20 and 21 have been suggested in order to make them valid documents in accordance with the specification (which allows for <entry> to be the top-level element, but then there is no "above" from which langCode could be inferred and therefore it must be explicitely stated wherever applicable). The main motivation behind modifying these examples was that they pass validation against the schema. If the examples in Appendix A were not necessarily meant to be full documents (there are, in my opinion, very few hints that would suggest it could be the case), then these changes might indeed be reverted; but in such case those examples will not validate against the schema anymore. Even if the design of the serializations allows for one or multiple entrys to form a valid document on their own, it should be noted that a pre-existing dependence on a lexicographicResource in form of its translationLanguage might be an obstacle against the otherwise nice idea of simply dumping a subset of a lexicographicResource's entrys on their own.

Note that the change in example 14 is not related to the required/optional langCode issue, but rather to the fixed/arbitrary element order issue as discussed above (to which some more of the proposed example changes, not explicitely mentioned in the list quoted above, are related).

jmccrae commented 5 months ago

These probably need to be implemented as new issues

mjakubicek commented 5 months ago

@DavidFatDavidF , can you please explain in more detail why the <xs:import> should be necessary? We tried to track it down but found no evidence that the normal way (xmlns:xs="http://www.w3.org/2001/XMLSchema) is deprecated or otherwise not recommended.

DavidFatDavidF commented 5 months ago

Make it reviewable. There will be more changes due to #72