Handling poorly formed XML

Hi Steve, I'm working on a data exchange project, where a group of developers of similar software are trying to work out a way to move data between programs. We're passing DOCx files as part of this. Unfortunately, some of the files I need to import don't follow the DOCx format specification very well. So far, I've handled EMF images, improper paragraph formatting parameter values ("exactly" instead of "exact") and improper color specifications ( instead of using RGB values) and a few other minor issues. But there's one issue I haven't been able to crack. The documents I need to read specify line breaks incorrectly:

      <w:r>
        <w:t xml:space="preserve">
          It's eleven thirty.<w:br/><w:br/>
          It's eleven thirty.<w:br/><w:br/>
          (into phone) It's eleven thirty.<w:br/><w:br/>
          Diet Coke break.<w:br/><w:br/>
          Diet Coke break.<w:br/><w:br/>
          Diet Coke break.<w:br/><w:br/>
          (Music begins)<w:br/><w:br/>
        </w:t>
      </w:r>

instead of:

  <w:p w:rsidR="00575306" w:rsidRDefault="001D5A36">
    <w:r>
      <w:t>It's eleven thirty.</w:t>
    </w:r>
    <w:r>
      <w:br/>
    </w:r>
    <w:r>
      <w:br/>
      <w:t>It's eleven thirty.</w:t>
    </w:r>
    <w:r>
      <w:br/>
    </w:r>
    <w:r>
      <w:br/>
      <w:t>
        (into phone) It's eleven thirty.
      </w:t>
    </w:r>
    <w:r>
      <w:br/>
    </w:r>
    <w:r>
      <w:br/>
      <w:t>Diet Coke break.</w:t>
    </w:r> (etc)

My imports bring in only the first text block from each run, the text that comes in before the first "" tag and skips the rest of the text. Effectively, the run ends at the improper tag. (The file loads in Word 2010 correctly.) The author of the program outputting the offending files talks about using OpenOffice standards and blames his DOCx module and promises he'll have his programmer look at it, but I've concluded that my best course of action would be to try to figure out how to correctly import these incorrect files. Despite feeling reasonably comfortable working with python-docx code, I can't seem to crack where and how to intervene regarding these extra line break tags. I'd appreciate any suggestions you can throw my way. David Woods

What approach are you taking? The three I can think of are:

Write "loose" functions (often called "workaround functions" in python-docx context) that you can use instead of python-docx methods, like paragraph_text(paragraph) to "replace" paragraph.text.
Modify the base python-docx code and use your own fork.
Possibly code your own Paragraph subclass and monkey-patch it in so that gets constructed instead of Paragraph in the right places. A variant of this that might work and is easy to monkey-patch is to provide a replacement CT_Paragraph class that just replaces a couple of "finding" methods.

All three of those would work. #1 is probably the least invasive and quickest time-to-value. Option #2 makes it hard to upgrade as new features come online. #3 is a little fancier than #1 and is probably how I would go in this sort of situation (but I have a lot of "internals" knowledge).

In any case, you'd want to intervene in and around the Paragraph object and its oxml units, perhaps here: https://github.com/python-openxml/python-docx/blob/master/docx/text/paragraph.py#L116

I'd say you want to start by identifying which parts of the python-docx interface you wanted to behave differently (I would keep this as narrow as possible). If you could get by with Paragraph.text, then I'd just write a replacement for that method. I'd be inclined to make free use of XPath to find the elements you wanted, in their document order, and then extract the text from them.

A Paragraph.text replacement would be a nice clean scope that you could implement as paragraph_text(paragraph) or one of the other choices. The oxml monkey-patching method would be more difficult and inherently more brittle because you'd essentially be trying to fool the existing python-docx code, which is subject (but probably unlikely) to change.

HOWEVER: now that I've looked at the example more closely, it looks like you can accommodate the single problem of br's in a t element by simply patching CT_Text here: https://github.com/python-openxml/python-docx/blob/master/docx/oxml/text/run.py#L107, to override it's text property and do the right thing with embedded breaks. Not sure how easy that will be because those represent "mixed" content (elements interspersed with element text, not just child elements), like HTML <br> elements. You might need to make a separate parser for that or I'm not sure what. I'd study the lxml docs on that count.

python-openxml / python-docx

Handling poorly formed XML #625