python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.52k stars 1.11k forks source link

extension: support .docm (macro-enabled) Word files #284

Open JosephHardy91 opened 8 years ago

JosephHardy91 commented 8 years ago

This module is great, but it doesn't currently work with docm files or macros.

Add in a way to both import macros into the VBProject and save docm files.

Currently, the only way to do this is pywin32 and if you both import a macro and save as docm for 5 files, it takes about 30 sec-1 minute which is obviously unacceptable.

Your module is perfectly positioned to take care of this problem and I'd hope to see an improvement sometime in the future.

Thanks!

scanny commented 8 years ago

Do you have an example of a .docm file you can share? (You can just drag and drop it in the comment box in response.)

I had understood that macros were stored in a binary format, so I'm not sure how you'd be able to generate them in Python, even if you could "import" them.

But I'll take a look at the file structure and see what we see.

ethack commented 8 years ago

I would like to see this implemented as well. I've attached a sample document containing a Hello World macro.

Since Github wouldn't let me upload the document as-is, I changed the extension to ".zip".

MacroTest.docm.zip

For reference, here is the macro I put in the document:

Sub AutoOpen()
    MsgBox ("Hello, World!")
End Sub
scanny commented 8 years ago

Thanks @ethack, just the ticket :)

Here's what the innards look like:

$ unzip -l MacroTest.docm                                                                                                                            Archive:  MacroTest.docm
  Length     Date   Time    Name
 --------    ----   ----    ----
     2485  01-01-80 00:00   [Content_Types].xml
      590  01-01-80 00:00   _rels/.rels
     1976  01-01-80 00:00   word/_rels/document.xml.rels
     1977  01-01-80 00:00   word/document.xml
     1295  01-01-80 00:00   word/header3.xml
     1295  01-01-80 00:00   word/footer2.xml
     1295  01-01-80 00:00   word/footer1.xml
     1295  01-01-80 00:00   word/header2.xml
     1295  01-01-80 00:00   word/header1.xml
     1675  01-01-80 00:00   word/endnotes.xml
     1681  01-01-80 00:00   word/footnotes.xml
     1295  01-01-80 00:00   word/footer3.xml
     6795  01-01-80 00:00   word/theme/theme1.xml
      277  01-01-80 00:00   word/_rels/vbaProject.bin.rels
     9728  01-01-80 00:00   word/vbaProject.bin
     2699  01-01-80 00:00   word/settings.xml
     1367  01-01-80 00:00   word/vbaData.xml
      497  01-01-80 00:00   word/webSettings.xml
    29856  01-01-80 00:00   word/styles.xml
      712  01-01-80 00:00   docProps/app.xml
      727  01-01-80 00:00   docProps/core.xml
     1261  01-01-80 00:00   word/fontTable.xml
 --------                   -------
    72073                   22 files

Note the item word/vbaProject.bin. I've had a quick look, and no surprise, but it's definitely not plain text.

The other two files of interest are vbaProject.bin.rels:

$ opc browse MacroTest.docm vbaProject.bin.rels                                                                                                      
<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="x" Type="http://schemas.microsoft.com/office/2006/relationships/wordVbaData" Target="vbaData.xml"/>
</Relationships>

... and vbaData.xml:

$ opc browse MacroTest.docm word/vbaData.xml                                                                                                         <?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<wne:vbaSuppData
    xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
    xmlns:o="urn:schemas-microsoft-com:office:office"
    xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
    xmlns:v="urn:schemas-microsoft-com:vml"
    xmlns:w10="urn:schemas-microsoft-com:office:word"
    xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
    xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
    xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"
    xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
    xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
    xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas"
    xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
    xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
    xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
    mc:Ignorable="w14 w15 wp14"
    >
  <wne:mcds>
    <wne:mcd wne:macroName="PROJECT.THISDOCUMENT.AUTOOPEN" wne:name="Project.ThisDocument.AutoOpen" wne:bEncrypt="00" wne:cmg="56"/>
  </wne:mcds>
</wne:vbaSuppData>

The relationships one would be straightforward, and the data one as well I suppose, assuming there's a schema or other documentation out there, or someone willing to do the reverse engineering to determine what goes in there.

But I don't know how we'd do the binary bit. I would expect it's proprietary and undocumented, but haven't really looked into it.

Do you have any idea of what functionality might be useful and how it might be implemented?

ethack commented 8 years ago

I'm still doing research into this, but thought I'd share what I have so far. It appears the format is documented (how well, I have yet to determine) and some Python tools exist to parse it already.

https://msdn.microsoft.com/en-us/library/cc313094%28v=office.12%29.aspx http://www.decalage.info/vba_tools

The functionality I'd be looking for specifically would be to pass a string that contains the raw VBA source code and have a function that inserts it into a Word document. I've long been wanting a way to programmatically generate VBA source code and insert it into a document. Personally, this is useful to me if I want to generate a bunch of documents that are similar and only differ with say a single variable in the macro code.

scanny commented 8 years ago

This is very interesting @ethack :)

From time to time the only way to discover how the MS API for Word (or PowerPoint for python-pptx) behaves is to experiment. And this is a bit of a pain given what it takes to do that when you're not running natively on Windows. We often need to do this while designing a new API element.

So it would be awesome to be able to take some plain-text code and be able to produce a test docx that runs that code once opened in Windows.

Based on what you sent, I'm thinking what we would need in python-docx (and could probably have in python-pptx as well), is a way to take an arbitrary "black-box" VBA binary and lodge it in the right place in the package, perhaps adding the right relationships and possibly the vbaData.xml. The VBA black-box could be written with one of the tools you pointed toward.

Let us know as you come up with more in your research. If we could get some kind of filthy hack or even hand-assembled file that proves the concept we'll definitely be interested in designing API support for this. We'd probably need a contributor to do the needful, so you might consider if you're willing to play that role for this one, let me know if you are :)

scanny commented 8 years ago

Hi Guys, just a request for some input if you will.

Adding support so you can open and save the .docm format (but not access or manipulate the macro-related parts) would be a relatively modest undertaking.

How much benefit do you see it that?

The likelihood that would be implemented in the "sooner" time frame is a lot higher I expect.

ethack commented 8 years ago

Personally, I don't have a need for that use case.

scanny commented 8 years ago

ok, thanks @ethack :)

JosephHardy91 commented 8 years ago

Yes, this would be helpful to me. My macro is a one size fits all, so as long as I have a template file(which I do), opening a docm and saving it with macro contents intact would work perfectly well in my case. And there is also the requirement that it be editable like any docx file.

Thanks

scanny commented 8 years ago

okay, good to know, thanks @blindidiot91 :)

I'll look to scope this one out.

mustard007 commented 7 years ago

Hi! I'm interested in that too.

Thanks!

ozgurpolat commented 7 years ago

Hi,

I am wondering if there is any progress/decision made on this topic. I rally need this extension for the project I am working on. We have some template files and they have some macros in them, if we save them as docx then we lose the macros and they are no longer useful to us. It would be very helpful to modify the document (without modifying the macros) as scanny describes. I found this link https://github.com/python-openxml/python-docx/issues/212 where it is hinted that it might be possible to add this feature to a local copy of python-docx but I am not sure how I can implement this. Any help will be greatly appreciated. Thanks in advance.

scanny commented 7 years ago

Where do you run into a problem? Be specific and include actual error messages with trace back if any.

cucrisis commented 7 years ago

Hi, was this implemented ?

scanny commented 7 years ago

Not yet. Generally each case is closed as it's implemented. No one has stepped up to sponsor this one or contribute it yet.

ozgurpolat commented 7 years ago

Hi scanny,

I am quite happy to volunteer and help you implement this feature because I need it for the project I am working on. But I am an engineer and not a professional programmer, so I have to sit down and learn docx api.

So far I tried this:

  1. I had to find out where docx module was so I imported imp and searched for docx `>>> import imp

    imp.find_module("docx")' (None, "/home/ozgur/miniconda3/envs/dlnd/lib/python3.6/site-packages/docx", ('', '', 5))`

  2. Then I looked for 'constants.py' in opc folder, and added the following content type (as described here: https://blogs.msdn.microsoft.com/vsofficedeveloper/2008/05/08/office-2007-file-format-mime-types-for-http-content-streaming-2/): DOCM = ( 'application/vnd.ms-word.document.macroEnabled.12.main+xml' )
  3. Then I went into 'init.py' in docx folder and I commeted out: 'PartFactory.part_type_for[CT.WML_DOCUMENT_MAIN] = DocumentPart' I added following line instead (At this point I have no idea what I am doing I am just trying out things to see what error messages I get: 'PartFactory.part_type_for[CT.DOCM] = DocumentPart'
  4. And then I went into 'api.py' and replaced this line: if document_part.content_type != CT.WML_DOCUMENT_MAIN: with this one: if document_part.content_type != CT.DOCM:

Then I went into python and typed and received the following: `>>> from docx import Document

document = Document() Traceback (most recent call last): File "", line 1, in File "/home/ozgur/miniconda3/envs/dlnd/lib/python3.6/site-packages/docx/api.py", line 28, in Document raise ValueError(tmpl % (docx, document_part.content_type)) ValueError: file '/home/ozgur/miniconda3/envs/dlnd/lib/python3.6/site-packages/docx/templates/default.docx' is not a Word file, content type is 'application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml'`

I this point I realized that there is a 'default.docx' file, I would also have to replace that with a default.docm and then I gave up, the whole thing doesn't seem to be as simple as '3vocati' describes in here https://github.com/python-openxml/python-docx/issues/212. I realize that I would have to go deeper and really understand docx api to implement this feature.

scanny commented 7 years ago

@ozgurpolat The place to start is where the error comes from: https://github.com/python-openxml/python-docx/blob/master/docx/api.py#L26

    if document_part.content_type != CT.WML_DOCUMENT_MAIN:
        tmpl = "file '%s' is not a Word file, content type is '%s'"
        raise ValueError(tmpl % (docx, document_part.content_type))

If you change this to:

    openable_content_types = (
        CT.WML_DOCUMENT_MAIN,
        CT.WML_DOCUMENT_MACRO_ENABLED_MAIN
    )
    if document_part.content_type not in openable_content_types:
        tmpl = "file '%s' is not a Word file, content type is '%s'"
        raise ValueError(tmpl % (docx, document_part.content_type))

.. you'll get past the first error. The next error is something like CT.WML_DOCUMENT_MACRO_ENABLED_MAIN undefined. So you go into constants.py and add a member to that enumeration that maps the content type url.

Finally, you need to add a mapping here so a part of this content type gets instantiated using the right (DocumentPart) class. https://github.com/python-openxml/python-docx/blob/master/docx/__init__.py#L29

ozgurpolat commented 7 years ago

@scanny

Thank you very much, your help is highly appreciated, it worked.

I made the changes to api.py described above and,

I added the following to constants.py: WML_DOCUMENT_MACRO_ENABLED_MAIN = ( 'application/vnd.ms-word.document.macroEnabled.main+xml' )

and the following to init.py: PartFactory.part_type_for[CT.WML_DOCUMENT_MACRO_ENABLED_MAIN] = DocumentPart

scanny commented 7 years ago

Glad you got it working @ozgurpolat :)

oezgan commented 7 years ago

@ozgurpolat & @scanny Thank you very much 👍

mustard007 commented 7 years ago

Thanks both of you !

So now, we can work with .docm files ?

cucrisis commented 6 years ago

Thanks awesome work

snbby commented 6 years ago

Hi,

According to the comments, this issue seems to be resolved. But here is still an error, when I am trying to open .docm file:

from docx import Document

document = Document('macro_file.docm')
document.add_heading('Heading, level 1', level=1)
document.save('another_file.docm')

raises

ValueError: file 'macro_file.docm' is not a Word file, content type is 'application/vnd.ms-word.document.macroEnabled.main+xml'

Have I missed smth? Or .docm support was not implemented?

benzkji commented 5 years ago

@scanny I would benefit greatly of opening, changing and writing .docm files, without touching any macro code (as I understand it, writing or changing macros is the hard part...). you removed the shortlist label - are there any problems? Is I understood, @ozgurpolat already made opening possible? What is still missing, to have this feature released?

benzkji commented 5 years ago

Hi all. An update on this issue would be greatly apprecciated. I would also be able to sponsor opening/editing/closing withoug touching VBA part...if this is still a modest undertaking? @scanny @mustard007 @oezgan @ozgurpolat anyone?

oezgan commented 5 years ago

It has been almost two years, i really had to scratch my head to remember this thread here. Sorry no update from me. I think I had converted the macro enabled document into a non macro word document since i was not interested in the macros. Beyond that i am afraid i cant help.

benzkji commented 5 years ago

@scanny I'll try to prepare a PR, including @ozgurpolat s approach.

benzkji commented 5 years ago

but if you got no time, I'll let it be, and not fill up the PR list even more :|

benzkji commented 5 years ago

(as I'll find a solution this or that way ;-) thanks for the work, though, great library!