python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.38k stars 1.08k forks source link

Unable to get text of MergeFields (sometimes) #1370

Open FreeCoffee21 opened 3 months ago

FreeCoffee21 commented 3 months ago

I'm using python-docx to open a document template and to replace placeholders by actual text. Screenshot of template1.docx (for the file, scroll to the bottom):

2024-04-08_14-33-12

The following code sometimes worked, sometimes not:

from docx import Document

def replace_placeholders(document, data):
    def replace_in_paragraph(paragraph):
        if paragraph is not None:
            for field, value in data.items():
                if f"«{field}»" in paragraph.text:
                    paragraph.text = paragraph.text.replace(f"«{field}»", value)

    def replace_in_bullet_list(bullet_list):
        if bullet_list is not None:
            for paragraph in bullet_list:
                if paragraph.text is not None:
                    replace_in_paragraph(paragraph)

    # Replace placeholders in paragraphs
    for paragraph in document.paragraphs:
        replace_in_paragraph(paragraph)

    # Replace placeholders in tables in main document
    for table in document.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    replace_in_paragraph(paragraph)

    # Replace placeholders in bullet lists in main document
    for bullet_list in document.element.xpath("//w:p[w:pPr/w:numPr]"):
        replace_in_bullet_list(bullet_list)

    # Replace placeholders in tables in headers
    for section in document.sections:
        header = section.header
        if header is not None:
            for paragraph in header.paragraphs:
                #print(paragraph.text)
                for field, value in data.items():
                    if f"«{field}»" in paragraph.text:
                        paragraph.text = paragraph.text.replace(f"«{field}»", value)
            for table in header.tables:
                for row in table.rows:
                    for cell in row.cells:
                        for paragraph in cell.paragraphs:
                            for field, value in data.items():
                                if f"«{field}»" in paragraph.text:
                                    paragraph.text = paragraph.text.replace(f"«{field}»", value)`

data = {
    "Name": "John Doe",
    "Age": "30",
    "Occupation": "Software Engineer",
    "Location": "New York",
    "Csharp": "2 years",
    "Java": "6 years",
    "Company": "Ferdy AB"
}

# Load the Word template
template = Document("template1.docx")

# Replace placeholders with actual data
replace_placeholders(template, data)

# Save the modified Word document
template.save("output1.docx")

The issue is caused by the template file (template1.docx, created and saved using Word Version 2401 out of Microsoft 365). For reasons I don't understand, the placeholders, which are fields of type MergeField, are sometimes stored file in tags, sometimes in tags in the document.xml. In the former case, the above code works, in the latter case, it does not. Below, I'm giving an example of the 'Name' placeholder in the numbered list. In the first case the template has only the placeholder, in the seconds case it has the placeholder plus a space character. Placeholder only (excerpt from document.xml):

    <w:r>
        <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r>
        <w:instrText xml:space="preserve"> MERGEFIELD  Name  \* MERGEFORMAT </w:instrText>
      </w:r>
      <w:r>
        <w:fldChar w:fldCharType="separate"/>
      </w:r>
      <w:r w:rsidR="000C7414">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:t>«Name»</w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:fldChar w:fldCharType="end"/>
      </w:r>

Placeholder plus subsequent space (excerpt from document.xml):

      <w:fldSimple w:instr=" MERGEFIELD  Name  \* MERGEFORMAT ">
        <w:r w:rsidR="000C7414">
          <w:rPr>
            <w:noProof/>
          </w:rPr>
          <w:t>«Name»</w:t>
        </w:r>
      </w:fldSimple>
      <w:r w:rsidR="00697A4F">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:t xml:space="preserve"></w:t>
      </w:r>

I don't know whether that's a bug or a feature?! In any case, I have no idea how to work around this one. I'm uploading the template that works with the code. If a space is added after the 'Name' placeholder, it does not work anymore... template1.docx