python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.59k stars 1.13k forks source link

isolate_run() #980

Open scanny opened 3 years ago

scanny commented 3 years ago

This is some code I developed to answer this SO question.

You give it a character-position range in a paragraph and it does the needful to isolate that range of characters into its own single run having the same character formatting as the original. If you don't change the text of the paragraph between calls, it can be called repeatedly with different ranges to isolate multiple ranges, like multiple matches to re.Pattern.findall().

I'm not sure what will become of it but it was more work than I originally guessed so I want to keep it around for future reference.

import itertools

def isolate_run(paragraph, start, end):
    """Return docx.text.Run object containing only `paragraph.text[start:end]`.

    Runs are split as required to produce a new run at the `start` that ends at `end`.
    Runs are unchanged if the indicated range of text already occupies its own run. The
    resulting run object is returned.

    `start` and `end` are as in Python slice notation. For example, the first three
    characters of the paragraph have (start, end) of (0, 3). `end` is *not* the index of
    the last character. These correspond to `match.start()` and `match.end()` of a regex
    match object and `s[start:end]` in Python slice notation.
    """
    rs = tuple(paragraph._p.r_lst)

    def advance_to_run_containing_start(start, end):
        """Return (r_idx, start, end) triple indicating start run and adjusted offsets.

        The start run is the run the `start` offset occurs in. The returned `start` and
        `end` values are adjusted to be relative to the start of `r_idx`.
        """
        # --- add 0 at end so `r_ends[-1] == 0` ---
        r_ends = tuple(itertools.accumulate(len(r.text) for r in rs)) + (0,)
        r_idx = 0
        while start >= r_ends[r_idx]:
            r_idx += 1
        skipped_rs_offset = r_ends[r_idx - 1]
        return rs[r_idx], r_idx, start - skipped_rs_offset, end - skipped_rs_offset

    def split_off_prefix(r, start, end):
        """Return adjusted `end` after splitting prefix off into separate run.

        Does nothing if `r` is already the start of the isolated run.
        """
        if start > 0:
            prefix_r = copy.deepcopy(r)
            r.addprevious(prefix_r)
            r.text = r.text[start:]
            prefix_r.text = prefix_r.text[:start]
        return end - start

    def split_off_suffix(r, end):
        """Split `r` at `end` such that suffix is in separate following run."""
        suffix_r = copy.deepcopy(r)
        r.addnext(suffix_r)
        r.text = r.text[:end]
        suffix_r.text = suffix_r.text[end:]

    def lengthen_run(r, r_idx, end):
        """Add prefixes of following runs to `r` until `end` is reached."""
        while len(r.text) < end:
            suffix_len_reqd = end - len(r.text)
            r_idx += 1
            next_r = rs[r_idx]
            if len(next_r.text) <= suffix_len_reqd:
                # --- subsume next run ---
                r.text = r.text + next_r.text
                next_r.getparent().remove(next_r)
                continue
            if len(next_r.text) > suffix_len_reqd:
                # --- take prefix from next run ---
                r.text = r.text + next_r.text[:suffix_len_reqd]
                next_r.text = next_r.text[suffix_len_reqd:]

    # --- 1. skip over any runs before the one containing the start of our range ---
    r, r_idx, start, end = advance_to_run_containing_start(start, end)

    # --- 2. split first run where our range starts, placing "prefix" to our range
    # ---    in a new run inserted just before this run. After this, our run will begin
    # ---    at the right point and the left-hand side of our work is done.
    end = split_off_prefix(r, start, end)

    # --- 3. if run is longer than isolation-range we need to split-off a suffix run ---
    if len(r.text) > end:
        split_off_suffix(r, end)

    # --- 4. But if our run is shorter than the desired isolation-range we need to
    # ---    lengthen it by taking text from subsequent runs
    elif len(r.text) < end:
        lengthen_run(r, r_idx, end)

    # --- if neither 3 nor 4 apply it's because the run already ends in the right place
    # --- and there's no further work to be done.

    return Run(r, paragraph)
ymmeng commented 3 years ago

I want to set each character in each paragraph of the document as a separate "run". What parameters should I pass in? After trying for a long time, I always have a bunch of strange results:

demo.docx:

ABCD
EFGA

code:

doc = docx.Document('demo.docx')
for par in doc.paragraphs:
    isolate_run(par, 0, len(par.runs))
    for run in par.runs:
        print(run.text)

print:

A
BCD
E
FGA
ymmeng commented 3 years ago

What I expect: print:

A
B
C
D
E
F
G
A
scanny commented 3 years ago

If you want each run to be one character long you can use something like:

for start in range(len(paragraph.text)):
    end = start + 1
    isolate_run(paragraph, start, end)

The len(par.runs) that appears in your code is the number of runs in the paragraph, which just doesn't have anything to do with what you're trying to do.

The way to think about start and end is like this:


"""
paragraph.text: "ABCDE"

      A   B   C   D   E 
        |           |
    start           end
"""
>>> start, end = 1, 4
>>> run = isolate_run(paragraph, start, end)
>>> run.text
'BCD'
ymmeng commented 3 years ago

Thank you very much for your patience. He will be of great help to me. Thank you!

tranduchuy682 commented 2 years ago

Thank @scanny . That's awesome

itslily88 commented 1 year ago

I am having trouble understanding the return Run(r, paragraph).

NameError: name 'Run' is not defined

I have a word document that is already created through different checkboxes selected by the user. Those check boxes get text from plain text files to input into the word document. I would like to add styles to certain runs that I use a series of hyphens to signal. For example, the word document may look like this:

------Details about the incident

Here are the events that detail the specific incident in question. below are the listed sub categories

---This is sub category 1

details about this section

---This is sub category 2

details about this section

I would like any line with "------" to be bold and a larger font, and any line with "---" to just be bold.

https://github.com/python-openxml/python-docx/issues/30#issuecomment-879593691 works great for simply replacing the text (removing the ------ but leaving the text), but any formatting on run sets everything to the same formatting.

I would assume the isolate_run() would work for me, but I cannot get passed the return Run(r, paragraph) to even walk through how to make it work what I need.

Here is how I am calling paragraph_replace_text(doc, '------'):

def formatting(document, oldText):
    oldTextLength = len(oldText)
    for oldPara in document.paragraphs:    
        if oldPara.text.find(oldText) >= 0:
            paraText = oldPara.text
            for line in paraText.splitlines():
                if oldText in line: 
                    newText = line[oldTextLength:]
                    paragraph_replace_text(oldPara, re.compile(f'{oldText}{newText}'), newText)

My thought would be inside paragraph_replace_text I would call isolate_run after the '------' is removed with the start and end in there as the passed variables, but I cant get it to run with the return Run(r, paragraph) to try.

Any help would be appreciated