Closed kylegibson closed 9 years ago
Based on some quick research, I don't think this is a bug in PyXB. PyXB is intended to operate on XML documents that are validated against XML schemas. XAML is an XML-based language which uses a different validation semantics, in particular allowing individual documents to change what namespaces are validated. PyXB is not an XAML processor and won't ignore those namespaces, so you will get validation errors if they are referenced but not validatable.
You might first convert the document to DOM format, then run a preprocessing step that removes elements and attributes that have a prefix that appears in an {http://schemas.openxmlformats.org/markup-compatibility/2006}Ignorable
attribute. PyXB should be able to process what's left.
That the exception this produces doesn't have a nice text representation is a reasonable complaint, though. I've added that as issue #31.
Hi Peter,
I appreciate the thorough response. It looks like I can just extend the default SAX handler used by PyXB to filter out these elements and attributes. I don't quite have a working implementation yet, but I'll post it when I'm done.
Thanks, -Kyle
Hi Peter,
I have a SAX handler that overrides the default PyXB SAX handler to strip out the ignorable attributes. This appears to cause PyXB to raise ContentNondeterminismExceededError: Nondeterminism exceeded validating
. The code in Configuration.candidateTransitions
and AutomatonConfiguration.step
is fairly complex, so I am struggling to resolve the issue. If there's any pointers, advice or references you could share I would appreciate it.
I prefer to avoid having to pre-process the XML before passing it to PyXB.
Thanks, -Kyle
"Override" or "extend"? You probably shouldn't discard PyXB's SAX handler in favor of your own, but you could subclass it and overload some of the methods to strip out the attributes (and elements) that are in ignorable namespaces.
It may also simply be that the documents you're using are nondeterministic and exceed the configured threshold. You could try increasing PermittedNondeterminism slowly to see if there's a reasonable threshold that makes it pass. Be aware that the larger the value you use, the more memory PyXB may require to validate the document, and the longer it will take.
Sorry, I mean extend. I'm subclassing pyxb.binding.saxer.PyXBSAXHandler
and overloading the startElementNS
method. That part seems to be doing exactly what I want.
My hypothesis was that the ContentNondeterminismExceededError
exception was being caused by my SAX handler. To test that, I manually removed all of the ignorable attributes from the XML document, and attempted to load it using PyXB. I got the same ContentNondeterminismExceededError
exception. I increased the PermittedNondeterminism
to 1024. Exception still occurs. I've encountered this issue on almost all of my sample documents thus far except one. That particular sample is a very simple document, only containing a single word. It's not yet clear to me what's special about my other samples that is causing this problem.
I also lack understanding of the purpose of the determinism check. It's not clear to me what non-determinism means in this context, or why/whether it's a problem. If there's any references or advice you could share I would appreciate it.
Thanks, -Kyle
Try this stackoverflow question, this PyXB test case, and possibly the technical references in the PyXB FAC documentation. More generally, a google for "nondeterminism in xml" might be fruitful, or the more common nondeterministic finite automata.
PyXB "resolves" nondeterminism by executing multiple candidate parses in parallel until only one succeeds or the number of potential candidates exceeds the limit. In grossly nondeterministic languages this can happen with pretty small documents.
Thanks so much for your help Peter, it is greatly appreciated.
After some reading and testing, it appears that I will not be able to utilize PyXB generated bindings to open and interact with ECMA-376 (v2008 transitional) documents due to this issue with nondeterminism.
For example, the following document requires a PermittedNondeterminism
of 12288
, and takes about 5 seconds on my system (quad core, 16GB ram) to process:
<?xml version="1.0"?>
<w:document xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing">
<w:body>
<w:p>
<w:pPr>
<w:pStyle w:val="Normal"/>
<w:rPr/>
</w:pPr>
<w:ins w:id="1" w:author="Foo" w:date="2013-01-29T14:31:00Z">
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>This is an insertion</w:t>
</w:r>
</w:ins>
<w:ins w:id="2" w:author="Foo" w:date="2013-01-29T14:31:00Z">
<w:r>
<w:rPr/>
<w:t xml:space="preserve">. </w:t>
</w:r>
</w:ins>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t xml:space="preserve">This is </w:t>
</w:r>
<w:del w:id="3" w:author="Foo" w:date="2013-02-05T18:50:00Z">
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:delText>the</w:delText>
</w:r>
</w:del>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t xml:space="preserve"> end</w:t>
</w:r>
<w:r>
<w:rPr/>
<w:t xml:space="preserve"> of the</w:t>
</w:r>
<w:ins w:id="4" w:author="Foo" w:date="2013-01-29T14:31:00Z">
<w:r>
<w:rPr/>
<w:t xml:space="preserve"> inserted</w:t>
</w:r>
</w:ins>
<w:r>
<w:rPr/>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:commentRangeStart w:id="0"/>
<w:r>
<w:rPr/>
<w:t>paragraph</w:t>
</w:r>
<w:commentRangeEnd w:id="0"/>
<w:r>
<w:rPr/>
</w:r>
<w:r>
<w:rPr/>
<w:commentReference w:id="0"/>
</w:r>
<w:r>
<w:rPr/>
<w:t>.</w:t>
</w:r>
</w:p>
</w:body>
</w:document>
PyXB includes the EMCA-376 generating script, and while it can generate the bindings without issue, actually using them in practice doesn't appear reliable. Is that your experience with ECMA-376?
I have no personal experience using the ECMA-376 bindings; they were added primarily as an example after another user had problems generating them. To my knowledge that user was able to accomplish hir task with them, but may have been using a namespace that wasn't as generic.
Using Word 2013, I created a very simple .docx. I extracted the .docx, and attempted to load
word/document.xml
using the binding classes I generated usingpyxbgen
from the transitional schema files. This causes pyxb to raiseUnrecognizedAttributeError
apparently due to themc:Ignorable
attribute on thedocument
element:I'm able to avoid this exception and load the document by removing the
mc:Ignorable
because the namespaces defined in the ignorable are not being used in this particular document. I do have other documents which reference these other ignorable namespaces, which causes pyxb to fail.Traceback:
The traceback doesn't actually indicate the attribute name, which itself is unfortunate.
attr_en
is passed in the exception, but it's not evaluated unless you convert it to a string. I used aipdb.set_trace
(apparent in the above traceback) to reveal the attribute name.