python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.63k stars 1.13k forks source link

Enhancement: OPC OOXML support #892

Open c-bik opened 4 years ago

c-bik commented 4 years ago

Hi,

First off thanks a lot for this awesome library. It does a great job with most of the normal cases.

Recently, however, I came across a case (Word.Body. getOoxml() from Microsoft's Office-JS) where I couldn't find a way to parse/load that into a docx object directly.

I am receiving an XML (MIME application/xml) of the following structure:

<?xml version="1.0" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">
    <pkg:part pkg:name="/_rels/.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" pkg:padding="512">
        <pkg:xmlData>
            <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
                <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
            </Relationships>
        </pkg:xmlData>
    </pkg:part>
    <pkg:part pkg:name="/word/_rels/document.xml.rels" pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" pkg:padding="256">
        <pkg:xmlData>
            <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
                <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer1.xml"/>
                ...
                <Relationship Id="rId9" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer2.xml"/>
            </Relationships>
        </pkg:xmlData>
    </pkg:part>
    <pkg:part pkg:name="/word/document.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml">
        <pkg:xmlData>
            <w:document ...>
                <w:body>
                    <w:p w:rsidR="009606CA" w:rsidRDefault="009606CA">
                        ...
                    </w:p>
                    ...
                </w:body>
            </w:document>
        </pkg:xmlData>
    </pkg:part>
    <pkg:part pkg:name="/word/footer2.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml">
        <pkg:xmlData>
            <w:ftr ...>
                ...
            </w:ftr>
        </pkg:xmlData>
    </pkg:part>
    ...
    <pkg:part pkg:name="/word/media/image1.jpg" pkg:contentType="image/jpeg" pkg:compression="store">
        <pkg:binaryData>... base64 encoded image binary...
        </pkg:binaryData>
    </pkg:part>
    ...
    <pkg:part pkg:name="/word/webSettings.xml" pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml">
    </pkg:part>
</pkg:package>

It looked like a serialised from of the docx package, though I couldn't find any schema (XSD etc) for <pkg:package>, so I attempted to implement support directly into python-docx and as a result got that working for me.

The work-in-progress diff of the implementation: https://github.com/python-openxml/python-docx/compare/master...c-bik:load-from-opc-ooxml-support

Usage:

from docx import Document

if __name__ == '__main__':
    ooxml: bytes = b'<pkg:package>...</pkg:package>'
    document = Document(ooxml)
    print([p.text for p in document.paragraphs])

Now I am wondering how should it really be done so I can potentially pull-request such a feature? If this feature is useful/interesting for a future python_docx release I can then invest some time to work on it to turn it into an acceptable PR.

Looking forward.

Best, Bikram

scanny commented 3 years ago

@c-bik Thanks for reaching out. I don't expect we would want to add this into python-docx directly, but perhaps it is an opportunity for a companion package like python-docx-templates that builds on python-docx as a dependency.

c-bik commented 3 years ago

@scanny Thanks for the response. It feels like this feature alone is too small to be part of another companion package rather than a standalone package of its own. Do you think I should PR to python-docx-templates in stead?

scanny commented 3 years ago

I don't think it fits with python-docx-templates, I was just using that as an example of a companion package. I'm not really sure what the use case for this would be, maybe you can explain that a bit more.

c-bik commented 3 years ago

I don't think it fits with python-docx-templates, I was just using that as an example of a companion package.

Ah. I got you now. So something like python-openxml/python-docx-package?

I'm not really sure what the use case for this would be, maybe you can explain that a bit more.

I don't suppose there would be a very significant/noticeable use case coming out of this. Perhaps a small expansion of docx loading scope.

Currently docx.Document(docx=None) can load from any path or file-like objects which seamlessly integrates with any HTTP file upload endpoint where the content is expected to be binary docx payload (zipped folder structure) etc.

However, Microsoft Add-ins for Word provides an API to get the package level single XML for entire word document (as you can see an example in OP above, <pkg:package><pkg:part>...). At the moment there is no suitable way to plug such an XML directly (such from an HTTP POST body) into python-docx.

In such a case, what we can do now is parse the parts into some in-memory folder structure as expected in docx (open xml) format, zip it and then pass that as a stream into python-docx. Since python-docx is going to parse / process it again, in the PR I'm proposing, I'm trying to eliminate all that by directly accepting (also) a byte stream of such an XML through the same docx.Document(docx=None) interface and then proceeding with the parsing a bit differently.

I must also mention that, apart from the MS/TS API I mentioned above, I don't know of any other source of such an XML.