"mc:Ignorable" causes UnrecognizedAttributeError in EMCA-376 document

kylegibson commented 9 years ago

Using Word 2013, I created a very simple .docx. I extracted the .docx, and attempted to load word/document.xml using the binding classes I generated using pyxbgen from the transitional schema files. This causes pyxb to raise UnrecognizedAttributeError apparently due to the mc:Ignorable attribute on the document element:

<w:document xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"  xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"  xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" mc:Ignorable="w14 w15 wp14">

I'm able to avoid this exception and load the document by removing the mc:Ignorable because the namespaces defined in the ignorable are not being used in this particular document. I do have other documents which reference these other ignorable namespaces, which causes pyxb to fail.

Traceback:

----> 1 result = w.CreateFromDocument(data)

/home/kyle/dev/pydocx/transitional/w.pyc in CreateFromDocument(xml_text, default_namespace, location_base)
     60     if isinstance(xmld, _six.text_type):
     61         xmld = xmld.encode(pyxb._InputEncoding)
---> 62     saxer.parse(io.BytesIO(xmld))
     63     instance = handler.rootObject()
     64     return instance

/home/kyle/.pyenv/versions/2.7.6/lib/python2.7/xml/sax/expatreader.pyc in parse(self, source)
    105         self.reset()
    106         self._cont_handler.setDocumentLocator(ExpatLocator(self))
--> 107         xmlreader.IncrementalParser.parse(self, source)
    108
    109     def prepareParser(self, source):

/home/kyle/.pyenv/versions/2.7.6/lib/python2.7/xml/sax/xmlreader.pyc in parse(self, source)
    121         buffer = file.read(self._bufsize)
    122         while buffer != "":
--> 123             self.feed(buffer)
    124             buffer = file.read(self._bufsize)
    125         self.close()

/home/kyle/.pyenv/versions/2.7.6/lib/python2.7/xml/sax/expatreader.pyc in feed(self, data, isFinal)
    208             # document. When feeding chunks, they are not normally final -
    209             # except when invoked from close.
--> 210             self._parser.Parse(data, isFinal)
    211         except expat.error, e:
    212             exc = SAXParseException(expat.ErrorString(e.code), e, self)

/home/kyle/.pyenv/versions/2.7.6/lib/python2.7/xml/sax/expatreader.pyc in start_element_ns(self, name, attrs)
    339
    340         self._cont_handler.startElementNS(pair, None,
--> 341                                           AttributesNSImpl(newattrs, qnames))
    342
    343     def end_element_ns(self, name):

/home/kyle/.pyenv/versions/2.7.6/lib/python2.7/site-packages/pyxb/binding/saxer.pyc in startElementNS(self, name, qname, attrs)
    368         # Process the element start.  This may or may not return a
    369         # binding object.
--> 370         binding_object = this_state.startBindingElement(type_class, new_object_factory, element_decl, attrs)
    371
    372         # If the top-level element has complex content, this sets the

/home/kyle/.pyenv/versions/2.7.6/lib/python2.7/site-packages/pyxb/binding/saxer.pyc in startBindingElement(self, type_class, new_object_factory, element_decl, attrs)
    205             try:
    206                 pyxb.namespace.NamespaceContext.PushContext(self.namespaceContext())
--> 207                 self.__constructElement(new_object_factory, attrs)
    208             finally:
    209                 pyxb.namespace.NamespaceContext.PopContext()

/home/kyle/.pyenv/versions/2.7.6/lib/python2.7/site-packages/pyxb/binding/saxer.pyc in __constructElement(self, new_object_factory, attrs, content)
    133             # The binding instance may be a simple type that does not support
    134             # attributes; the following raises an exception in that case.
--> 135             self.__bindingInstance._setAttribute(attr_en, attrs.getValue(attr_name))
    136
    137         return self.__bindingInstance

/home/kyle/.pyenv/versions/2.7.6/lib/python2.7/site-packages/pyxb/binding/basis.pyc in _setAttribute(self, attr_en, value_lex)
   2236             if self._AttributeWildcard is None:
   2237                 import ipdb; ipdb.set_trace()
-> 2238                 raise pyxb.UnrecognizedAttributeError(type(self), attr_en, self)
   2239             self.__wildcardAttributeMap[attr_en] = value_lex
   2240         else:

UnrecognizedAttributeError: (<class '_nsgroup_.CT_Document'>, <pyxb.namespace.ExpandedName object at 0x2020bd0>, <_nsgroup_.CT_Document object at 0x2020450>, pyxb.utils.utility.Location(None, 2, 0))

The traceback doesn't actually indicate the attribute name, which itself is unfortunate. attr_en is passed in the exception, but it's not evaluated unless you convert it to a string. I used a ipdb.set_trace (apparent in the above traceback) to reveal the attribute name.

pabigot commented 9 years ago

Based on some quick research, I don't think this is a bug in PyXB. PyXB is intended to operate on XML documents that are validated against XML schemas. XAML is an XML-based language which uses a different validation semantics, in particular allowing individual documents to change what namespaces are validated. PyXB is not an XAML processor and won't ignore those namespaces, so you will get validation errors if they are referenced but not validatable.

You might first convert the document to DOM format, then run a preprocessing step that removes elements and attributes that have a prefix that appears in an {http://schemas.openxmlformats.org/markup-compatibility/2006}Ignorable attribute. PyXB should be able to process what's left.

That the exception this produces doesn't have a nice text representation is a reasonable complaint, though. I've added that as issue #31.

kylegibson commented 9 years ago

Hi Peter,

I appreciate the thorough response. It looks like I can just extend the default SAX handler used by PyXB to filter out these elements and attributes. I don't quite have a working implementation yet, but I'll post it when I'm done.

Thanks, -Kyle

kylegibson commented 9 years ago

Hi Peter,

I have a SAX handler that overrides the default PyXB SAX handler to strip out the ignorable attributes. This appears to cause PyXB to raise ContentNondeterminismExceededError: Nondeterminism exceeded validating. The code in Configuration.candidateTransitions and AutomatonConfiguration.step is fairly complex, so I am struggling to resolve the issue. If there's any pointers, advice or references you could share I would appreciate it.

I prefer to avoid having to pre-process the XML before passing it to PyXB.

Thanks, -Kyle

pabigot commented 9 years ago

"Override" or "extend"? You probably shouldn't discard PyXB's SAX handler in favor of your own, but you could subclass it and overload some of the methods to strip out the attributes (and elements) that are in ignorable namespaces.

It may also simply be that the documents you're using are nondeterministic and exceed the configured threshold. You could try increasing PermittedNondeterminism slowly to see if there's a reasonable threshold that makes it pass. Be aware that the larger the value you use, the more memory PyXB may require to validate the document, and the longer it will take.

kylegibson commented 9 years ago

Sorry, I mean extend. I'm subclassing pyxb.binding.saxer.PyXBSAXHandler and overloading the startElementNS method. That part seems to be doing exactly what I want.

My hypothesis was that the ContentNondeterminismExceededError exception was being caused by my SAX handler. To test that, I manually removed all of the ignorable attributes from the XML document, and attempted to load it using PyXB. I got the same ContentNondeterminismExceededError exception. I increased the PermittedNondeterminism to 1024. Exception still occurs. I've encountered this issue on almost all of my sample documents thus far except one. That particular sample is a very simple document, only containing a single word. It's not yet clear to me what's special about my other samples that is causing this problem.

I also lack understanding of the purpose of the determinism check. It's not clear to me what non-determinism means in this context, or why/whether it's a problem. If there's any references or advice you could share I would appreciate it.

Thanks, -Kyle

pabigot commented 9 years ago

Try this stackoverflow question, this PyXB test case, and possibly the technical references in the PyXB FAC documentation. More generally, a google for "nondeterminism in xml" might be fruitful, or the more common nondeterministic finite automata.

PyXB "resolves" nondeterminism by executing multiple candidate parses in parallel until only one succeeds or the number of potential candidates exceeds the limit. In grossly nondeterministic languages this can happen with pretty small documents.

kylegibson commented 9 years ago

Thanks so much for your help Peter, it is greatly appreciated.

After some reading and testing, it appears that I will not be able to utilize PyXB generated bindings to open and interact with ECMA-376 (v2008 transitional) documents due to this issue with nondeterminism.

For example, the following document requires a PermittedNondeterminism of 12288, and takes about 5 seconds on my system (quad core, 16GB ram) to process:

<?xml version="1.0"?>
<w:document xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing">
  <w:body>
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Normal"/>
        <w:rPr/>
      </w:pPr>
      <w:ins w:id="1" w:author="Foo" w:date="2013-01-29T14:31:00Z">
        <w:r>
          <w:rPr>
            <w:b/>
          </w:rPr>
          <w:t>This is an insertion</w:t>
        </w:r>
      </w:ins>
      <w:ins w:id="2" w:author="Foo" w:date="2013-01-29T14:31:00Z">
        <w:r>
          <w:rPr/>
          <w:t xml:space="preserve">. </w:t>
        </w:r>
      </w:ins>
      <w:r>
        <w:rPr>
          <w:b/>
        </w:rPr>
        <w:t xml:space="preserve">This is </w:t>
      </w:r>
      <w:del w:id="3" w:author="Foo" w:date="2013-02-05T18:50:00Z">
        <w:r>
          <w:rPr>
            <w:b/>
          </w:rPr>
          <w:delText>the</w:delText>
        </w:r>
      </w:del>
      <w:r>
        <w:rPr>
          <w:b/>
        </w:rPr>
        <w:t xml:space="preserve"> end</w:t>
      </w:r>
      <w:r>
        <w:rPr/>
        <w:t xml:space="preserve"> of the</w:t>
      </w:r>
      <w:ins w:id="4" w:author="Foo" w:date="2013-01-29T14:31:00Z">
        <w:r>
          <w:rPr/>
          <w:t xml:space="preserve"> inserted</w:t>
        </w:r>
      </w:ins>
      <w:r>
        <w:rPr/>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:commentRangeStart w:id="0"/>
      <w:r>
        <w:rPr/>
        <w:t>paragraph</w:t>
      </w:r>
      <w:commentRangeEnd w:id="0"/>
      <w:r>
        <w:rPr/>
      </w:r>
      <w:r>
        <w:rPr/>
        <w:commentReference w:id="0"/>
      </w:r>
      <w:r>
        <w:rPr/>
        <w:t>.</w:t>
      </w:r>
    </w:p>
  </w:body>
</w:document>

PyXB includes the EMCA-376 generating script, and while it can generate the bindings without issue, actually using them in practice doesn't appear reliable. Is that your experience with ECMA-376?

pabigot commented 9 years ago

I have no personal experience using the ECMA-376 bindings; they were added primarily as an example after another user had problems generating them. To my knowledge that user was able to accomplish hir task with them, but may have been using a namespace that wasn't as generic.

pabigot / pyxb

"mc:Ignorable" causes UnrecognizedAttributeError in EMCA-376 document #30