relaton / relaton-py

Python library for Relaton
MIT License
1 stars 1 forks source link

XML Schema validation fails after adding <refcontent> under <reference> element #35

Closed stefanomunarini closed 2 years ago

stefanomunarini commented 2 years ago

After updating the reference serializer (PR) to include \<refcontent> as part of the \<reference> element, the XML outputted stopped validating against the official RNC Schema.

As a reference, the error raised by the validator is:

lxml.etree.DocumentInvalid: Element 'reference', attribute 'refcontent': The attribute 'refcontent' is not allowed.

Related: https://github.com/ietf-ribose/bibxml-service/issues/155

ronaldtse commented 2 years ago

This is directly related to the comment made here: https://github.com/ietf-ribose/bibxml-service/issues/228#issuecomment-1177699864

According to the latest RFCXML v3, <refcontent> is allowed in <reference>: https://authors.ietf.org/rfcxml-vocabulary#refcontent

In addition, the RNC Schema supports <refcontent> in <reference>:

It seems that this is an instance of the validation process being broken?

stefanomunarini commented 2 years ago

It seems that this is an instance of the validation process being broken?

What do you mean with this @ronaldtse ?

ronaldtse commented 2 years ago

The refcontent element is a valid one under reference, so validation failure should not happen. If it fails, it means the validation is broken?

strogonoff commented 2 years ago

@ronaldtse I think the implication here is that RNC schema given to us to validate against may be wrong. The error is pretty clear: the schema we have, contrary to the comment that prompted us to use <refcontent>, does not allow <refcontent>. Perhaps IETF has a more up-to-date schema?

strogonoff commented 2 years ago

@stefanomunarini or maybe we are using an outdated schema in tests?

strogonoff commented 2 years ago

@stefanomunarini There could also be a potential where there were issues during the conversion of RNC to XSD, maybe this is what @ronaldtse meant.

Our XSD schema—I think it’s a result of semi-automatic RNC conversion—has refcontent here: https://github.com/relaton/relaton-py/blob/e44562f1b43268369ff973b197117e613f130a8e/relaton/tests/static/schemas/v3.xsd#L1362-L1373

Why does it fail? It looks like it allows an unbounded number of refcontent elements…

strogonoff commented 2 years ago

I think converting this:

https://github.com/ietf-tools/xml2rfc/blob/3d7e63462ea05b24c1a0d85a8edf1c4d9b8b4a3a/xml2rfc/data/v3.rnc#L1064

into this:

https://github.com/relaton/relaton-py/blob/e44562f1b43268369ff973b197117e613f130a8e/relaton/tests/static/schemas/v3.xsd#L1367-L1372

might be the source of this error.

I think (foo | bar)* in RNC means OR, not exclusive OR (i.e., foo, bar, or both, any number of times), but xs:choice we validate against seems to imply “foo XOR bar” (one or the other but not both).

EDIT: However, because xs:choice appears inside a xs:sequence, it should actually be valid?

@stefanomunarini maybe you could look into how it’s supposed to work, whether our XSD matches the source RNC, and whether relaton-py’s version of lxml has any current bugs regarding schema validation…

stefanomunarini commented 2 years ago

It was indeed an instance of the validation process. This commit fixes it: https://github.com/relaton/relaton-py/pull/33/commits/eaf8273efa642d7c67cd3b61c8f56e7922f6312f

ronaldtse commented 2 years ago

Thanks @stefanomunarini !