iulica / docx-mailmerge

Mail merge for Office Open XML (docx) files without the need for Microsoft Office Word.
MIT License
55 stars 7 forks source link

Document is gray when opening specifically for 2 page docs #4

Closed pmar0 closed 1 year ago

pmar0 commented 2 years ago

This issue is from the original docx-mailmerge, but it seems to still be present in this version

Expected Behavior

I make a 2 page document using merge_templates and it opens normally.

Current Behavior

Currently, I can generate the document fine and open it to a gray screen. I can then get the screen to come up by pressing Alt+F9 or clicking the screen a few times. I think the alt+f9 just gives it a refresh vs any actual correlation to the issue. This specifically happens with documents of 2 pages only, that I've found. I've tested 1 page, 3 pages, and pages in the 100s and that all loads normally..so weird, haha.

Possible Solution

I'm guessing it's some tweak that needs to happen with the rebuilding of the XML in the merge_templates function. I'm going to look into it a little bit myself.

Steps to Reproduce (for bugs)

Create any test data for a merge_templates that has 2 items, so it results in 2 pages Open the 2 page doc and observe the lovely gray screen Press Alt+F9 to fix it, then press again to turn the functionality of alt+f9 off I used the separator of newPage_section and now also just tested with page_break and got the same result. Context Just making some labels and came across this issue. Checked to make sure my added function wasn't causing the issue and I found that it wasn't; the current merge_templates has the same issue.

Your Environment

Python version: 3.9.6 docx-mailmerge version: 0.5.0 Microsoft 365 version 2202

pmar0 commented 2 years ago

For extra info, here's some the code I used to generate the test info:

import unittest
from os import path

from mailmerge import MailMerge
from tests.utils import EtreeMixin, get_document_body_part

class MergeTest(EtreeMixin, unittest.TestCase):
    def test_merge(self):
        document = MailMerge(path.join(path.dirname(__file__), 'test_salesignsv2.docx'))

        def test_array(size):
            tester = []

            for i in range(0,size):
                test = {}
                for field in document.get_merge_fields():
                    test[field] = f'test{i}'

                tester.append(test)

            return tester

        tester = test_array(2)
        document.merge_templates(tester,'nextPage_section')
        document.write('testOutput.docx')

Files used:

Input file: test_salesignsv2.docx Output file: testOutput.docx

I know it came out a bit different on your end previously, and I assume the same issues will be apparent this time. Very odd issue.

Screenshot:

One last bit, here's a screenshot of what I see when opening the doc (I can fix this and everything will show correctly if I press Alt+F9 to toggle display of fields, which I guess refreshes the doc:

image

iulica commented 2 years ago

Quote from the documentation:

When using this feature, make sure you don't use comments, footnotes, bookmarks, etc. This is because these elements have an id attribute, which must be unique. This library does not handle this, resulting in invalid documents.

Your input document contains those pictures that have elements with id="...".

                                    <wp:docPr id="19" name="Picture 19"/>
                                                    <pic:cNvPr id="1" name=""/>
                                    <wp:docPr id="11" name="Picture 11"/>
                                                    <pic:cNvPr id="3" name="infra_logo_rgb_0.png"/>
                                    <wp:docPr id="23" name="Picture 23"/>
                                                    <pic:cNvPr id="1" name=""/>
                                    <wp:docPr id="24" name="Picture 24"/>
                                                    <pic:cNvPr id="3" name="infra_logo_rgb_0.png"/>

If you remove the pictures, it should work out fine. You can try to include the pictures with the { INCLUDEPICTURE .. } maybe it works fine. You may ask why it works with more than 2 pages, well it doesn't. I created a 3 page and while it shows the page locally, on the online version, the pictures are not shown. It tells that the problem is with those pictures.

iulica commented 2 years ago

The original source docx has already duplicate id attributes for the pic:cNvPr, 1,3,1,3 ... So I assume that is less problematic as the duplicate id for the wp:docPr for Word. But this shows that the original source xml is not a proper xml file, as in XML the id attribute should have unique values. Now this particular issue could be solved as these ids do not seem to be used anywhere else. So new id values could be generated for subsequent pages. So I changed it from wontfix to enhancement.

iulica commented 2 years ago

Another possible solution, that doesn't look like a hack and would work with any document, is to implement a new "separator", named "document", that will output a new Word document for each row of data, a copy of the original with the data from the row instead. Something like:

    ...
    document.merge_templates(rows, "new_document")
    document.write("output_{rowno}.docx")
pmar0 commented 2 years ago

Interesting..this document was provided to me by an organization who I assume created it from scratch, so I'm not sure how those IDs got in there like that. I'm going to play around that that a bit and see how that works. I'm also curious as to how a standard mail merge works around that, as that has never given me issues, so I'll check that as well. Maybe a possible solution is me just reinserting those pictures so they have properly unique IDs? Further, I wonder why that issue only pops up for 2 page documents, locally.

In terms of a new doc for every row, that definitely wouldn't be possible. I have thousands of rows that need to get merged into these labels, so that'd be a lot, haha.

The document I have is pretty old, so I'm going to request a new one from the company to see if that has since been updated. Otherwise, I'll mess around with the one I have.

pmar0 commented 2 years ago

So I checked the normally merged doc (done via a normal word mail merge) and it still has those duplicate IDs. Check this out, this is a completed doc from a standard mail merge that I assume will have the same issues that you mentioned: Signs_January2022.docx

pmar0 commented 2 years ago

See attached for a newer sign file..looks slightly better in terms of duplicates (no longer 1,3,1,3...), as it just has the 1's being duped. Now that Im thinking about it, I'm guessing the problem is less with the duplicates on that page and more with the fact that the duplicates span onto other pages within the same document..am I understanding that correctly? Although it's weird that an normal word mail merge doesn't do anything about that and things seem to work okay. May Template_0.docx

iulica commented 2 years ago

So I checked the normally merged doc (done via a normal word mail merge) and it still has those duplicate IDs. Check this out, this is a completed doc from a standard mail merge that I assume will have the same issues that you mentioned: Signs_January2022.docx

Actually not, the problem is on the wp:docPr elements and not on the pic:cNvPr. The generated mailmerge docx contains unique ids for wp:docPr. So word just generates new elements with new names automatically when merging. For the pic it doesn't.

          ....
          <wp:docPr id="90" name="Picture 90"/>
          <wp:docPr id="91" name="Picture 91"/>
          <wp:docPr id="92" name="Picture 92"/>
          <wp:docPr id="93" name="Picture 93"/>
          <wp:docPr id="94" name="Picture 94"/>
pmar0 commented 2 years ago

I see, I overlooked that part where that's the main issue vs the pic:cNvPr. I'll have to play around with this a bit to see if I can't find a fix, at least for a proof of concept and to ensure that's the issue.

pmar0 commented 2 years ago

I was looking for the duplicate docPr id's in the testOutput doc and I couldn't find them. I see that the first page is all out of order, but the next pages continue counting upward and don't see to overlap at any point. Do you have a certain spot where you found duplicate id's?

Or wherever the issue is..reading back over things, I'm not clear on precisely where the issue lies. Is it because the images on the pages after one are trying to be duplicated and that's not working correctly?

iulica commented 2 years ago

I was looking for the duplicate docPr id's in the testOutput doc and I couldn't find them. I see that the first page is all out of order, but the next pages continue counting upward and don't see to overlap at any point. Do you have a certain spot where you found duplicate id's?

     unzip -p testOutput.docx word/document.xml | xmllint --format - | grep -E " id=\"" | grep docPr | sort

Or wherever the issue is..reading back over things, I'm not clear on precisely where the issue lies. Is it because the images on the pages after one are trying to be duplicated and that's not working correctly?

Yes, exactly. I have a commit with the necessary refactoring for the NEXT fields support.

pmar0 commented 2 years ago
     unzip -p testOutput.docx word/document.xml | xmllint --format - | grep -E " id=\"" | grep docPr | sort

I see...odd, I suppose the saving documents as XML files via word doesn't really work to create a proper XML doc for reference. Doing it that way, I see no duplicates, but then using your command to download an XML file and read it, I can find the issue.

Yes, exactly. I have a commit with the necessary refactoring for the NEXT fields support.

Ah, awesome! That won't really do anything for the images, though -- right? Or I suppose support to renumber things could be integrated after that.

iulica commented 2 years ago

I see...odd, I suppose the saving documents as XML files via word doesn't really work to create a proper XML doc for reference. Doing it that way, I see no duplicates, but then using your command to download an XML file and read it, I can find the issue.

When saving the output from Word Mailmerge, Word will not create duplicate id for docPr elements. It works fine. The duplicates are only created using docx-mailmerge.

Ah, awesome! That won't really do anything for the images, though -- right? Or I suppose support to renumber things could be integrated after that.

Yes, exactly, it won't fix the duplicate id issue. But if you create more than 2 pages, you mentioned that word seems to ignore the problem and works fine (at least in windows). So we can leave this open until a solution is found for the duplicate id elements. Not a very high priority I would think. But definitely doable.

pmar0 commented 2 years ago

When saving the output from Word Mailmerge, Word will not create duplicate id for docPr elements. It works fine. The duplicates are only created using docx-mailmerge.

Yeah, I also looked at the file I made with docx-mailmerge. It just seems saving a file as XML via Microsoft word doesn't show a true XML output..probably because it regenerates it in the correct form, I'd guess.

Yes, exactly, it won't fix the duplicate id issue. But if you create more than 2 pages, you mentioned that word seems to ignore the problem and works fine (at least in windows). So we can leave this open until a solution is found for the duplicate id elements. Not a very high priority I would think. But definitely doable.

Yeah, although it's not ideal, as that means document generation will be totally hit or miss. We can leave this open for the issue, then.

iulica commented 2 years ago

When saving the output from Word Mailmerge, Word will not create duplicate id for docPr elements. It works fine. The duplicates are only created using docx-mailmerge.

Yeah, I also looked at the file I made with docx-mailmerge. It just seems saving a file as XML via Microsoft word doesn't show a true XML output..probably because it regenerates it in the correct form, I'd guess.

Ok, I get what you mean. I never saved the XML from Word. I.always looked directly at the docx archive with that command.

pmar0 commented 2 years ago

Ok, I get what you mean. I never saved the XML from Word. I.always looked directly at the docx archive with that command.

Yeah, that seems to be the best way to do it, as saving it as an XML via word seems to fix all of the problems, haha.

iulica commented 2 years ago

Can you check if the issue is fixed ?

pmar0 commented 2 years ago

Looks like that did the job! At least in terms of fixing the pictures on the web version of Microsoft Word.

However, oddly enough, I just checked my local word with a 2 page document to be super sure and it STILL gives a grey screen when opening. That being said, the doc opens perfectly fine on the web version. Very odd! Mind you, I've messed with other 2 page documents and it's not my local word being weird.

Now that I think about it, could it be the way the sections are implemented? Not sure why it'd only pop up for two pages, but when looking over some XML docs, I realized that the section breaks aren't implemented the same as how a standard mail merge does it. They're added in an additional new section rather than being nested within the already existing one. Here's an example:

Normal mail merge

<w:p w14:paraId="5948741C" w14:textId="77777777" w:rsidR="00B13A6A" w:rsidRDefault="00B13A6A" w:rsidP="008532DF">
  <w:pPr>
    <w:ind w:left="258" w:right="258"/>
    <w:rPr>
      <w:rFonts w:ascii="Adobe Garamond Pro" w:hAnsi="Adobe Garamond Pro"/>
      <w:vanish/>
    </w:rPr>
    <w:sectPr w:rsidR="00B13A6A" w:rsidSect="00B13A6A">
      <w:pgSz w:w="12240" w:h="15840"/>
      <w:pgMar w:top="0" w:right="0" w:bottom="0" w:left="0" w:header="0" w:footer="0" w:gutter="0"/>
      <w:pgNumType w:start="1"/>
      <w:cols w:space="720"/>
    </w:sectPr>
  </w:pPr>
</w:p>

docx-mailmerge

<w:p w14:paraId="6F54B7A3" w14:textId="2B943C00" w:rsidR="008532DF" w:rsidRPr="008532DF" w:rsidRDefault="008532DF" w:rsidP="008532DF">
  <w:pPr>
    <w:ind w:left="258" w:right="258"/>
    <w:rPr>
      <w:rFonts w:ascii="Adobe Garamond Pro" w:hAnsi="Adobe Garamond Pro"/>
      <w:vanish/>
    </w:rPr>
  </w:pPr>
</w:p>
<w:p>
  <w:pPr>
    <w:sectPr w:rsidR="008532DF" w:rsidRPr="008532DF" w:rsidSect="00554A03">
      <w:pgSz w:w="12240" w:h="15840"/>
      <w:pgMar w:top="0" w:right="0" w:bottom="0" w:left="0" w:header="0" w:footer="0" w:gutter="0"/>
      <w:cols w:space="720"/>
      <w:type w:val="nextPage"/>
    </w:sectPr>
  </w:pPr>
</w:p>
iulica commented 2 years ago

Now that I think about it, could it be the way the sections are implemented? Not sure why it'd only pop up for two pages, but when looking over some XML docs, I realized that the section breaks aren't implemented the same as how a standard mail merge does it. They're added in an additional new section rather than being nested within the already existing one.

New paragraph you mean. Yes, that's how it's done but it shouldn't matter. Word adds the section break to the last existing paragraph, docx-mailmerge adds an empty paragraph at the end with the section break. What I noticed is, word adds a new empty paragraph at the end of the document, after the last section break.

pmar0 commented 2 years ago

New paragraph you mean. Yes, that's how it's done but it shouldn't matter. Word adds the section break to the last existing paragraph, docx-mailmerge adds an empty paragraph at the end with the section break. What I noticed is, word adds a new empty paragraph at the end of the document, after the last section break.

When I said "section" I was referring to the newpage_section, nextColumn_section, etc. I forget that when dealing with this stuff the word "section" is actually a very specific thing, haha.

Oh? That's interesting..maybe that could be it. Should be easy to add that within the write command, so there's no need to worry about finding the end within the merging functions. The write command actually also blanks out any remaining mergefields as it is, which is nice.

iulica commented 2 years ago

It may be easy, but I'm not sure it is the right way, to overcomplicate things. Since the online version works, it can be, that it is a specific bug in the version you have, and that the newer versions won't have the same problem.

Implementing this would also mean to rewrite pretty much all original tests, as they will no longer work. If someone wants to take the time to change this working behaviour so that it matches Word perfectly, than it is fine by me, as long as the implementation is clean.

pmar0 commented 2 years ago

It may be easy, but I'm not sure it is the right way, to overcomplicate things. Since the online version works, it can be, that it is a specific bug in the version you have, and that the newer versions won't have the same problem.

Implementing this would also mean to rewrite pretty much all original tests, as they will no longer work. If someone wants to take the time to change this working behaviour so that it matches Word perfectly, than it is fine by me, as long as the implementation is clean.

I see. But don't you think closer to standard mail merge is what we're looking to achieve? Regardless, I can see it making trouble for tests and the like..especially if it doesn't add much functionality.

I tested it with a standard mail merge of 2 pages and that also has the same issue, so I guess it's somehow an issue with the current version of Word 365. I even updated it the other day when I had the issue and it still persisted..very odd! So as you said, best to not worry about it.

We can probably close this down, but I'll leave it open so you can do whatever you'd like with it before closing.

iulica commented 2 years ago

It may be easy, but I'm not sure it is the right way, to overcomplicate things. Since the online version works, it can be, that it is a specific bug in the version you have, and that the newer versions won't have the same problem. Implementing this would also mean to rewrite pretty much all original tests, as they will no longer work. If someone wants to take the time to change this working behaviour so that it matches Word perfectly, than it is fine by me, as long as the implementation is clean.

I see. But don't you think closer to standard mail merge is what we're looking to achieve? Regardless, I can see it making trouble for tests and the like..especially if it doesn't add much functionality.

Of course, however the goal is also not to clone Word in all ways. Unfortunately the OpenXML format is IMO not a very good one. It can be translated mostly like "whatever Word does is the standard". So it is not feasible or practical to try to mimic Word's behaviour in all small details.

I tested it with a standard mail merge of 2 pages and that also has the same issue, so I guess it's somehow an issue with the current version of Word 365. I even updated it the other day when I had the issue and it still persisted..very odd! So as you said, best to not worry about it.

We can probably close this down, but I'll leave it open so you can do whatever you'd like with it before closing.

I don't think we should close it, perhaps when more reports of it come around then there is a reason to give it a higher priority.

pmar0 commented 2 years ago

Of course, however the goal is also not to clone Word in all ways. Unfortunately the OpenXML format is IMO not a very good one. It can be translated mostly like "whatever Word does is the standard". So it is not feasible or practical to try to mimic Word's behaviour in all small details.

I understand. As long as it's not really causing any issues, then it should be fine. Just thought it might be worth trying to prevent any potential issues, but I'm not as knowledgeable on that as you.

I don't think we should close it, perhaps when more reports of it come around then there is a reason to give it a higher priority.

Okay, that makes sense. Even though it seems to be a bug with the the current build of Word 365, it can always be good to have in case someone else is looking for the same issue.

Thanks again for all of the help..I've learned a lot in messing with this bug and checking out your code and fixes.

iulica commented 2 years ago

I implemented the NEXT field, you can check it out.

pmar0 commented 2 years ago

I implemented the NEXT field, you can check it out.

I pulled in the latest changes and did a merge templates and neither local nor online word would open it. I couldn't find anything obviously wrong with the XML doc, but here's everything attached, for reference.

Word output:

testOutputv2.docx I can't attach the XML, as github doesn't support it, but I know you can use command prompt to get the word file unzipped.

I used the same input data as earlier in this thread, where I generate a few dictionaries with basic test data. It seems to all make it into the XML fine, and the obvious stuff seems to be in place. Is the data format I'm using correct? From the code I read, it looks like it utilizes the same format as the standard merge_templates, just having the NextRecord jump to the next "row" of data.

The only thing I could think of is if something in the table is incorrect. I checked the first couple entries against my fork from the original where I've been playing with NextRecord with a more basic setup to work towards this stuff and they seemed to match up fine, minus yours having some IDs on the runs and mine not; but having those IDs matches up to a normal mailmerge better, so it shouldn't be that.

Let me know what ya see.

iulica commented 2 years ago

Yeah, it was a problem with the Next field. Easy to fix, I have changed the test as well, cause it was too simple to catch this problem. Now it should work.

pmar0 commented 2 years ago

Yeah, it was a problem with the Next field. Easy to fix, I have changed the test as well, cause it was too simple to catch this problem. Now it should work.

Ah duh, now I see. Forgot the obvious check to see if the MergeField elements are still left in the doc, haha.

Looks like it's fully functional, now!

iulica commented 2 years ago

Yeah, the problem was with the last page, when there are no more rows of data. The remaining fields would be filled up at the end, with empty rows of data. But the remaining Next fields would just interrupt the process and so they would remain in the document.

iulica commented 2 years ago

Looks like that did the job! At least in terms of fixing the pictures on the web version of Microsoft Word.

However, oddly enough, I just checked my local word with a 2 page document to be super sure and it STILL gives a grey screen when opening. That being said, the doc opens perfectly fine on the web version. Very odd! Mind you, I've messed with other 2 page documents and it's not my local word being weird.

If you can create a very simple document that still has this problem, preferably only with just one field, I could have a look at it. I need the source document, the output from word that works, the output of docx-mailmerge that has the problem and the sample python code so I can test it.

pmar0 commented 2 years ago

Yeah, the problem was with the last page, when there are no more rows of data. The remaining fields would be filled up at the end, with empty rows of data. But the remaining Next fields would just interrupt the process and so they would remain in the document.

Makes sense

pmar0 commented 2 years ago

Looks like that did the job! At least in terms of fixing the pictures on the web version of Microsoft Word. However, oddly enough, I just checked my local word with a 2 page document to be super sure and it STILL gives a grey screen when opening. That being said, the doc opens perfectly fine on the web version. Very odd! Mind you, I've messed with other 2 page documents and it's not my local word being weird.

If you can create a very simple document that still has this problem, preferably only with just one field, I could have a look at it. I need the source document, the output from word that works, the output of docx-mailmerge that has the problem and the sample python code so I can test it.

I'll try to make something in a bit..but I tested it after and found that a two page mail merge done through Microsoft Word even has the gray screen issue, so it wasn't specific to the library as much as the current build of Word 365 I have.