python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.56k stars 1.12k forks source link

paragraph.text is missing text #1123

Open maya-burnard opened 2 years ago

maya-burnard commented 2 years ago

There's a certain type of comment (not to be confused with the comments stored in comments.xml) where old text is deleted and new text is inserted. screenshot

Extracting this text via

file=docx.Document(filepath)
print(file.paragraphs[0].text)

produces the following output

The civil war literature has substantial gaps. wo primary theoriescivil wars occur when citizens 1) become sufficiently motivated and 2) have the opportunity to rebel.  explain

In this circumstance the correct output should be

The civil war literature has substantial gaps. Two primary theories argue civil wars occur when citizens 1) become sufficiently motivated to commit violence and 2) have the opportunity to rebel. Neither theory explains
icegreentea commented 2 years ago

Assuming that this is from track changes, I think you can try something like:

from docx.oxml.ns import qn

text = []

for elem in doc.paragraphs[0]._element:
    if elem.tag == qn("w:r"):
        text.append(elem.text)
    elif elem.tag == qn("w:ins"):
        for sub_elem in elem:
            if sub_elem.tag == qn("w:r"):
                text.append(sub_elem.text)

text = "".join(text)

Replacing doc.paragraphs[0] appropriately.

The document standard uses <w:del> and <w:ins> as children of <w:p> to do these inline annotations. The current paragraph.text implement just iterates over runs that are directly children of paragraph, and so misses these annotations.

maya-burnard commented 2 years ago

This code works as expected.

An observation: I noticed both the <w:r> under <w:ins> and the <w:r> under <w:p> store their text in <w:t>, while <w:del> stores its text under <w:delText>. So if you're parsing the document.xml directly (with BeautifulSoup or something) you can pull the correct text by searching for <w:t> and joining the results.