Closed pmar0 closed 1 year ago
For extra info, here's some the code I used to generate the test info:
import unittest
from os import path
from mailmerge import MailMerge
from tests.utils import EtreeMixin, get_document_body_part
class MergeTest(EtreeMixin, unittest.TestCase):
def test_merge(self):
document = MailMerge(path.join(path.dirname(__file__), 'test_salesignsv2.docx'))
def test_array(size):
tester = []
for i in range(0,size):
test = {}
for field in document.get_merge_fields():
test[field] = f'test{i}'
tester.append(test)
return tester
tester = test_array(2)
document.merge_templates(tester,'nextPage_section')
document.write('testOutput.docx')
Input file: test_salesignsv2.docx Output file: testOutput.docx
I know it came out a bit different on your end previously, and I assume the same issues will be apparent this time. Very odd issue.
One last bit, here's a screenshot of what I see when opening the doc (I can fix this and everything will show correctly if I press Alt+F9
to toggle display of fields, which I guess refreshes the doc:
Quote from the documentation:
When using this feature, make sure you don't use comments, footnotes, bookmarks, etc. This is because these elements have an id attribute, which must be unique. This library does not handle this, resulting in invalid documents.
Your input document contains those pictures that have elements with id="...".
<wp:docPr id="19" name="Picture 19"/>
<pic:cNvPr id="1" name=""/>
<wp:docPr id="11" name="Picture 11"/>
<pic:cNvPr id="3" name="infra_logo_rgb_0.png"/>
<wp:docPr id="23" name="Picture 23"/>
<pic:cNvPr id="1" name=""/>
<wp:docPr id="24" name="Picture 24"/>
<pic:cNvPr id="3" name="infra_logo_rgb_0.png"/>
If you remove the pictures, it should work out fine. You can try to include the pictures with the { INCLUDEPICTURE .. } maybe it works fine. You may ask why it works with more than 2 pages, well it doesn't. I created a 3 page and while it shows the page locally, on the online version, the pictures are not shown. It tells that the problem is with those pictures.
The original source docx has already duplicate id attributes for the pic:cNvPr, 1,3,1,3 ... So I assume that is less problematic as the duplicate id for the wp:docPr for Word. But this shows that the original source xml is not a proper xml file, as in XML the id attribute should have unique values. Now this particular issue could be solved as these ids do not seem to be used anywhere else. So new id values could be generated for subsequent pages. So I changed it from wontfix to enhancement.
Another possible solution, that doesn't look like a hack and would work with any document, is to implement a new "separator", named "document", that will output a new Word document for each row of data, a copy of the original with the data from the row instead. Something like:
...
document.merge_templates(rows, "new_document")
document.write("output_{rowno}.docx")
Interesting..this document was provided to me by an organization who I assume created it from scratch, so I'm not sure how those IDs got in there like that. I'm going to play around that that a bit and see how that works. I'm also curious as to how a standard mail merge works around that, as that has never given me issues, so I'll check that as well. Maybe a possible solution is me just reinserting those pictures so they have properly unique IDs? Further, I wonder why that issue only pops up for 2 page documents, locally.
In terms of a new doc for every row, that definitely wouldn't be possible. I have thousands of rows that need to get merged into these labels, so that'd be a lot, haha.
The document I have is pretty old, so I'm going to request a new one from the company to see if that has since been updated. Otherwise, I'll mess around with the one I have.
So I checked the normally merged doc (done via a normal word mail merge) and it still has those duplicate IDs. Check this out, this is a completed doc from a standard mail merge that I assume will have the same issues that you mentioned: Signs_January2022.docx
See attached for a newer sign file..looks slightly better in terms of duplicates (no longer 1,3,1,3...), as it just has the 1's being duped. Now that Im thinking about it, I'm guessing the problem is less with the duplicates on that page and more with the fact that the duplicates span onto other pages within the same document..am I understanding that correctly? Although it's weird that an normal word mail merge doesn't do anything about that and things seem to work okay. May Template_0.docx
So I checked the normally merged doc (done via a normal word mail merge) and it still has those duplicate IDs. Check this out, this is a completed doc from a standard mail merge that I assume will have the same issues that you mentioned: Signs_January2022.docx
Actually not, the problem is on the wp:docPr elements and not on the pic:cNvPr. The generated mailmerge docx contains unique ids for wp:docPr. So word just generates new elements with new names automatically when merging. For the pic it doesn't.
....
<wp:docPr id="90" name="Picture 90"/>
<wp:docPr id="91" name="Picture 91"/>
<wp:docPr id="92" name="Picture 92"/>
<wp:docPr id="93" name="Picture 93"/>
<wp:docPr id="94" name="Picture 94"/>
I see, I overlooked that part where that's the main issue vs the pic:cNvPr
. I'll have to play around with this a bit to see if I can't find a fix, at least for a proof of concept and to ensure that's the issue.
I was looking for the duplicate docPr
id's in the testOutput doc and I couldn't find them. I see that the first page is all out of order, but the next pages continue counting upward and don't see to overlap at any point. Do you have a certain spot where you found duplicate id's?
Or wherever the issue is..reading back over things, I'm not clear on precisely where the issue lies. Is it because the images on the pages after one are trying to be duplicated and that's not working correctly?
I was looking for the duplicate
docPr
id's in the testOutput doc and I couldn't find them. I see that the first page is all out of order, but the next pages continue counting upward and don't see to overlap at any point. Do you have a certain spot where you found duplicate id's?
unzip -p testOutput.docx word/document.xml | xmllint --format - | grep -E " id=\"" | grep docPr | sort
Or wherever the issue is..reading back over things, I'm not clear on precisely where the issue lies. Is it because the images on the pages after one are trying to be duplicated and that's not working correctly?
Yes, exactly. I have a commit with the necessary refactoring for the NEXT fields support.
unzip -p testOutput.docx word/document.xml | xmllint --format - | grep -E " id=\"" | grep docPr | sort
I see...odd, I suppose the saving documents as XML files via word doesn't really work to create a proper XML doc for reference. Doing it that way, I see no duplicates, but then using your command to download an XML file and read it, I can find the issue.
Yes, exactly. I have a commit with the necessary refactoring for the NEXT fields support.
Ah, awesome! That won't really do anything for the images, though -- right? Or I suppose support to renumber things could be integrated after that.
I see...odd, I suppose the saving documents as XML files via word doesn't really work to create a proper XML doc for reference. Doing it that way, I see no duplicates, but then using your command to download an XML file and read it, I can find the issue.
When saving the output from Word Mailmerge, Word will not create duplicate id for docPr elements. It works fine. The duplicates are only created using docx-mailmerge.
Ah, awesome! That won't really do anything for the images, though -- right? Or I suppose support to renumber things could be integrated after that.
Yes, exactly, it won't fix the duplicate id issue. But if you create more than 2 pages, you mentioned that word seems to ignore the problem and works fine (at least in windows). So we can leave this open until a solution is found for the duplicate id elements. Not a very high priority I would think. But definitely doable.
When saving the output from Word Mailmerge, Word will not create duplicate id for docPr elements. It works fine. The duplicates are only created using docx-mailmerge.
Yeah, I also looked at the file I made with docx-mailmerge. It just seems saving a file as XML via Microsoft word doesn't show a true XML output..probably because it regenerates it in the correct form, I'd guess.
Yes, exactly, it won't fix the duplicate id issue. But if you create more than 2 pages, you mentioned that word seems to ignore the problem and works fine (at least in windows). So we can leave this open until a solution is found for the duplicate id elements. Not a very high priority I would think. But definitely doable.
Yeah, although it's not ideal, as that means document generation will be totally hit or miss. We can leave this open for the issue, then.
When saving the output from Word Mailmerge, Word will not create duplicate id for docPr elements. It works fine. The duplicates are only created using docx-mailmerge.
Yeah, I also looked at the file I made with docx-mailmerge. It just seems saving a file as XML via Microsoft word doesn't show a true XML output..probably because it regenerates it in the correct form, I'd guess.
Ok, I get what you mean. I never saved the XML from Word. I.always looked directly at the docx archive with that command.
Ok, I get what you mean. I never saved the XML from Word. I.always looked directly at the docx archive with that command.
Yeah, that seems to be the best way to do it, as saving it as an XML via word seems to fix all of the problems, haha.
Can you check if the issue is fixed ?
Looks like that did the job! At least in terms of fixing the pictures on the web version of Microsoft Word.
However, oddly enough, I just checked my local word with a 2 page document to be super sure and it STILL gives a grey screen when opening. That being said, the doc opens perfectly fine on the web version. Very odd! Mind you, I've messed with other 2 page documents and it's not my local word being weird.
Now that I think about it, could it be the way the sections are implemented? Not sure why it'd only pop up for two pages, but when looking over some XML docs, I realized that the section breaks aren't implemented the same as how a standard mail merge does it. They're added in an additional new section rather than being nested within the already existing one. Here's an example:
<w:p w14:paraId="5948741C" w14:textId="77777777" w:rsidR="00B13A6A" w:rsidRDefault="00B13A6A" w:rsidP="008532DF">
<w:pPr>
<w:ind w:left="258" w:right="258"/>
<w:rPr>
<w:rFonts w:ascii="Adobe Garamond Pro" w:hAnsi="Adobe Garamond Pro"/>
<w:vanish/>
</w:rPr>
<w:sectPr w:rsidR="00B13A6A" w:rsidSect="00B13A6A">
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="0" w:right="0" w:bottom="0" w:left="0" w:header="0" w:footer="0" w:gutter="0"/>
<w:pgNumType w:start="1"/>
<w:cols w:space="720"/>
</w:sectPr>
</w:pPr>
</w:p>
<w:p w14:paraId="6F54B7A3" w14:textId="2B943C00" w:rsidR="008532DF" w:rsidRPr="008532DF" w:rsidRDefault="008532DF" w:rsidP="008532DF">
<w:pPr>
<w:ind w:left="258" w:right="258"/>
<w:rPr>
<w:rFonts w:ascii="Adobe Garamond Pro" w:hAnsi="Adobe Garamond Pro"/>
<w:vanish/>
</w:rPr>
</w:pPr>
</w:p>
<w:p>
<w:pPr>
<w:sectPr w:rsidR="008532DF" w:rsidRPr="008532DF" w:rsidSect="00554A03">
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="0" w:right="0" w:bottom="0" w:left="0" w:header="0" w:footer="0" w:gutter="0"/>
<w:cols w:space="720"/>
<w:type w:val="nextPage"/>
</w:sectPr>
</w:pPr>
</w:p>
Now that I think about it, could it be the way the sections are implemented? Not sure why it'd only pop up for two pages, but when looking over some XML docs, I realized that the section breaks aren't implemented the same as how a standard mail merge does it. They're added in an additional new section rather than being nested within the already existing one.
New paragraph you mean. Yes, that's how it's done but it shouldn't matter. Word adds the section break to the last existing paragraph, docx-mailmerge adds an empty paragraph at the end with the section break. What I noticed is, word adds a new empty paragraph at the end of the document, after the last section break.
New paragraph you mean. Yes, that's how it's done but it shouldn't matter. Word adds the section break to the last existing paragraph, docx-mailmerge adds an empty paragraph at the end with the section break. What I noticed is, word adds a new empty paragraph at the end of the document, after the last section break.
When I said "section" I was referring to the newpage_section
, nextColumn_section
, etc. I forget that when dealing with this stuff the word "section" is actually a very specific thing, haha.
Oh? That's interesting..maybe that could be it. Should be easy to add that within the write
command, so there's no need to worry about finding the end within the merging functions. The write
command actually also blanks out any remaining mergefields as it is, which is nice.
It may be easy, but I'm not sure it is the right way, to overcomplicate things. Since the online version works, it can be, that it is a specific bug in the version you have, and that the newer versions won't have the same problem.
Implementing this would also mean to rewrite pretty much all original tests, as they will no longer work. If someone wants to take the time to change this working behaviour so that it matches Word perfectly, than it is fine by me, as long as the implementation is clean.
It may be easy, but I'm not sure it is the right way, to overcomplicate things. Since the online version works, it can be, that it is a specific bug in the version you have, and that the newer versions won't have the same problem.
Implementing this would also mean to rewrite pretty much all original tests, as they will no longer work. If someone wants to take the time to change this working behaviour so that it matches Word perfectly, than it is fine by me, as long as the implementation is clean.
I see. But don't you think closer to standard mail merge is what we're looking to achieve? Regardless, I can see it making trouble for tests and the like..especially if it doesn't add much functionality.
I tested it with a standard mail merge of 2 pages and that also has the same issue, so I guess it's somehow an issue with the current version of Word 365. I even updated it the other day when I had the issue and it still persisted..very odd! So as you said, best to not worry about it.
We can probably close this down, but I'll leave it open so you can do whatever you'd like with it before closing.
It may be easy, but I'm not sure it is the right way, to overcomplicate things. Since the online version works, it can be, that it is a specific bug in the version you have, and that the newer versions won't have the same problem. Implementing this would also mean to rewrite pretty much all original tests, as they will no longer work. If someone wants to take the time to change this working behaviour so that it matches Word perfectly, than it is fine by me, as long as the implementation is clean.
I see. But don't you think closer to standard mail merge is what we're looking to achieve? Regardless, I can see it making trouble for tests and the like..especially if it doesn't add much functionality.
Of course, however the goal is also not to clone Word in all ways. Unfortunately the OpenXML format is IMO not a very good one. It can be translated mostly like "whatever Word does is the standard". So it is not feasible or practical to try to mimic Word's behaviour in all small details.
I tested it with a standard mail merge of 2 pages and that also has the same issue, so I guess it's somehow an issue with the current version of Word 365. I even updated it the other day when I had the issue and it still persisted..very odd! So as you said, best to not worry about it.
We can probably close this down, but I'll leave it open so you can do whatever you'd like with it before closing.
I don't think we should close it, perhaps when more reports of it come around then there is a reason to give it a higher priority.
Of course, however the goal is also not to clone Word in all ways. Unfortunately the OpenXML format is IMO not a very good one. It can be translated mostly like "whatever Word does is the standard". So it is not feasible or practical to try to mimic Word's behaviour in all small details.
I understand. As long as it's not really causing any issues, then it should be fine. Just thought it might be worth trying to prevent any potential issues, but I'm not as knowledgeable on that as you.
I don't think we should close it, perhaps when more reports of it come around then there is a reason to give it a higher priority.
Okay, that makes sense. Even though it seems to be a bug with the the current build of Word 365, it can always be good to have in case someone else is looking for the same issue.
Thanks again for all of the help..I've learned a lot in messing with this bug and checking out your code and fixes.
I implemented the NEXT field, you can check it out.
I implemented the NEXT field, you can check it out.
I pulled in the latest changes and did a merge templates and neither local nor online word would open it. I couldn't find anything obviously wrong with the XML doc, but here's everything attached, for reference.
testOutputv2.docx I can't attach the XML, as github doesn't support it, but I know you can use command prompt to get the word file unzipped.
I used the same input data as earlier in this thread, where I generate a few dictionaries with basic test data. It seems to all make it into the XML fine, and the obvious stuff seems to be in place. Is the data format I'm using correct? From the code I read, it looks like it utilizes the same format as the standard merge_templates
, just having the NextRecord
jump to the next "row" of data.
The only thing I could think of is if something in the table is incorrect. I checked the first couple entries against my fork from the original where I've been playing with NextRecord with a more basic setup to work towards this stuff and they seemed to match up fine, minus yours having some IDs on the runs and mine not; but having those IDs matches up to a normal mailmerge better, so it shouldn't be that.
Let me know what ya see.
Yeah, it was a problem with the Next field. Easy to fix, I have changed the test as well, cause it was too simple to catch this problem. Now it should work.
Yeah, it was a problem with the Next field. Easy to fix, I have changed the test as well, cause it was too simple to catch this problem. Now it should work.
Ah duh, now I see. Forgot the obvious check to see if the MergeField
elements are still left in the doc, haha.
Looks like it's fully functional, now!
Yeah, the problem was with the last page, when there are no more rows of data. The remaining fields would be filled up at the end, with empty rows of data. But the remaining Next fields would just interrupt the process and so they would remain in the document.
Looks like that did the job! At least in terms of fixing the pictures on the web version of Microsoft Word.
However, oddly enough, I just checked my local word with a 2 page document to be super sure and it STILL gives a grey screen when opening. That being said, the doc opens perfectly fine on the web version. Very odd! Mind you, I've messed with other 2 page documents and it's not my local word being weird.
If you can create a very simple document that still has this problem, preferably only with just one field, I could have a look at it. I need the source document, the output from word that works, the output of docx-mailmerge that has the problem and the sample python code so I can test it.
Yeah, the problem was with the last page, when there are no more rows of data. The remaining fields would be filled up at the end, with empty rows of data. But the remaining Next fields would just interrupt the process and so they would remain in the document.
Makes sense
Looks like that did the job! At least in terms of fixing the pictures on the web version of Microsoft Word. However, oddly enough, I just checked my local word with a 2 page document to be super sure and it STILL gives a grey screen when opening. That being said, the doc opens perfectly fine on the web version. Very odd! Mind you, I've messed with other 2 page documents and it's not my local word being weird.
If you can create a very simple document that still has this problem, preferably only with just one field, I could have a look at it. I need the source document, the output from word that works, the output of docx-mailmerge that has the problem and the sample python code so I can test it.
I'll try to make something in a bit..but I tested it after and found that a two page mail merge done through Microsoft Word even has the gray screen issue, so it wasn't specific to the library as much as the current build of Word 365 I have.
This issue is from the original docx-mailmerge, but it seems to still be present in this version
Expected Behavior
I make a 2 page document using merge_templates and it opens normally.
Current Behavior
Currently, I can generate the document fine and open it to a gray screen. I can then get the screen to come up by pressing Alt+F9 or clicking the screen a few times. I think the alt+f9 just gives it a refresh vs any actual correlation to the issue. This specifically happens with documents of 2 pages only, that I've found. I've tested 1 page, 3 pages, and pages in the 100s and that all loads normally..so weird, haha.
Possible Solution
I'm guessing it's some tweak that needs to happen with the rebuilding of the XML in the merge_templates function. I'm going to look into it a little bit myself.
Steps to Reproduce (for bugs)
Create any test data for a merge_templates that has 2 items, so it results in 2 pages Open the 2 page doc and observe the lovely gray screen Press Alt+F9 to fix it, then press again to turn the functionality of alt+f9 off I used the separator of newPage_section and now also just tested with page_break and got the same result. Context Just making some labels and came across this issue. Checked to make sure my added function wasn't causing the issue and I found that it wasn't; the current merge_templates has the same issue.
Your Environment
Python version: 3.9.6 docx-mailmerge version: 0.5.0 Microsoft 365 version 2202