python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.64k stars 1.13k forks source link

After searching and replacing text the text font size of the replaced text is set to 12pt #409

Open the-vampiire opened 7 years ago

the-vampiire commented 7 years ago

I am using docx and regex to find and substitute certain keywords from the word doc. The finding and substituting is working fine. When the word is replaced it is set to a much smaller font than before, sometimes. This behavior is inconsistent - on some sections the text is replaced with the correct (matching previous text) font size.

I have not been able to find anything on this topic. Can anyone explain if this is expected behavior and if so how I can correct it?

I have tried inspecting the font size of the text before and after the substitution. According to the logs they are the same size but in the actual saved document they are not.

all fonts = Times New Roman

Original [docx] File:

Saved [docx] File:

Logs from Python:


  'after: \t\t        Pat  : 152400',

  'before:                                          courseCity, courseState : None',

  'after:                                          San Francisco, courseState : None',

  'before:                                          San Francisco, courseState : None',

  'after:                                          San Francisco, CA : None',

  'before:     \t\t\t       courseDate : 152400',

  'after:     \t\t\t       June 39, 2023 : 152400',

and the code itself (log points @ line 25 and 27):


#--- input from server ---#

task = sys.argv[1]

// parse as JSON in python to enable dict-like action on the student objects contained in the students array
students = json.loads(sys.argv[2])

#--- current date ---#

currentDate = time.strftime("%B %d, %Y")

## -------------- CORE FUNCTIONS ------------ ##

def findAndReplace(student, body):
    if 'currentDate' in body.text:
        find = re.compile("currentDate")
        body.text = find.sub(currentDate, body.text)
    for studentInfo in student:
        if studentInfo in body.text:
            find = re.compile(studentInfo)
            print('before: ' + str(body.text) + ' : ' + str(body.style.font.size))
            body.text = find.sub(student[studentInfo], body.text)
            print('after: ' + str(body.text) + ' : ' + str(body.style.font.size))

def newDoc(task):
    for student in students:

        # open a new template document for each student to prevent overlap
        document = docx.Document("./AutomationTemplates/" + task + "_template.docx")

        # check paragraph text
        for paragraph in document.paragraphs:
            findAndReplace(student, paragraph)

        # check table cells
        for table in document.tables:
            for row in table.rows:
                for cell in row.cells:
                    findAndReplace(student, cell)

        # save the document when finished
        document.save("./AutomationResults/"+ task + "/" + student["studentName"] + "_" + task + ".docx")

    print(task + "s created in /AutomationResults/" + task)

## ------ CALL THE NEW DOC PASSING THE AUTOMATION TASK -------- ##

newDoc(task)
the-vampiire commented 7 years ago

The inconsistency can be seen in my other file that I used for testing automation. In that file all the font that was replaced remained the correct font size and family.

scanny commented 7 years ago

Using the Paragraph.text property to replace text is convenient, but a bit of a brute force method. All the text formatting is specified at the run level, and it all gets nuked when you assign to Paragraph.text because that call removes all the existing runs before adding a single new one containing the assigned text.

It's a pretty hard problem in the fully general case, but what I usually do that works well almost all the time is to remove all the runs in the paragraph except the first one, and set its text to what I want. Generally, the first run is formatted the way you want, in my experience at least.

def set_cell_text_while_retaining_text_formatting(table_cell, text)
    # ---replace text of first run with new cell value---
    runs = table_cell.text_frame.paragraphs[0].runs
    runs[0].text = text
    # ---delete all remaining runs---
    for run in runs[1:]:
        r = run._element
        r.getparent().remove(r)
the-vampiire commented 7 years ago

Thank you @scanny I will have to try this function. I am a bit confused - I came across this repo which says it is now merged with python-openxml. If that is the case than are these features available? I dont see anywhere in the documentation how to use them.

https://github.com/mikemaccana/python-docx

Editing documents

Thanks to the awesomeness of the lxml module, we can:

Search and replace

Extract plain text of document

Add and delete items anywhere within the document

Change document properties

Run xpath queries against particular locations in the document - useful for retrieving data from user-completed templates.

scanny commented 7 years ago

That repo is the legacy version of python-docx, version 0.2. It was rewritten from the ground up for various reasons, a big one was to make it object oriented. None of the original code survived and the API is completely different. There are one or two things it tried to do that we haven't implemented yet in this version, search and replace being one of them. However I think you'll find it didn't really work in that earlier version, except for perhaps some very narrow use cases.

pylang commented 5 years ago

I had a similar issue when trying to substitute parts of text found via regex. My template document was font size 10, but the somehow the replaced text got set to font size 11.

To preserve the original size, here is an adapted solution pertaining to paragraphs (not tables as seen above). While iterating document.paragraphs, remove paragraph.text = text and call this function in the loop.

def set_text_preserving_text_formatting(paragraph, text):
    """Return None; remove all but the first run object from a paragraph."""    
    # Replace text of first run paragraph
    runs = paragraph.runs
    if not runs:
        return

    runs[0].text = text

    # Delete all remaining runs
    for run in runs[1:]:
        r = run._element
        r.getparent().remove(r)

Note, I have not tested this extensively. @scanny Thank you. Your code saved the day.

scanny commented 3 years ago

This function can reliably replace a certain word or phrase with another in a paragraph, retaining the formatting of the original word: https://github.com/python-openxml/python-docx/issues/30#issuecomment-879593691