Open maya-burnard opened 2 years ago
Assuming that this is from track changes, I think you can try something like:
from docx.oxml.ns import qn
text = []
for elem in doc.paragraphs[0]._element:
if elem.tag == qn("w:r"):
text.append(elem.text)
elif elem.tag == qn("w:ins"):
for sub_elem in elem:
if sub_elem.tag == qn("w:r"):
text.append(sub_elem.text)
text = "".join(text)
Replacing doc.paragraphs[0]
appropriately.
The document standard uses <w:del>
and <w:ins>
as children of <w:p>
to do these inline annotations. The current paragraph.text implement just iterates over runs that are directly children of paragraph, and so misses these annotations.
This code works as expected.
An observation: I noticed both the <w:r>
under <w:ins>
and the <w:r>
under <w:p>
store their text in <w:t>
, while <w:del>
stores its text under <w:delText>
. So if you're parsing the document.xml
directly (with BeautifulSoup or something) you can pull the correct text by searching for <w:t>
and joining the results.
There's a certain type of comment (not to be confused with the comments stored in comments.xml) where old text is deleted and new text is inserted.
Extracting this text via
produces the following output
In this circumstance the correct output should be