python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.38k stars 1.08k forks source link

Paragraphs/Numbering in Table of Contents Document Regions #1408

Closed polanddm closed 1 month ago

polanddm commented 1 month ago

Great work on this library, very useful.

My work has entailed parsing large MS Word documents organized hierarchically (Heading 1, 2, 3, etc.) and extracting text for further processing, tagging, and use in AI scenarios.

I have run across client documents where (somehow) the Word document has been saved in such a manner that named styles have been lost. This a surmountable problem by itself, but there is one nasty little impact I have discovered. This problem is not present if "named styles" are available.

As a result, I cannot (easily) detect the difference between a "content" paragraph ("1.1 Blah, lots more words...") and the "Table of Contents" entry for this paragraph ("1.1 Blah words.............. 42"). I have an acceptable workaround for this by testing if the paragraph text has a run of period characters of unusual length ("......."). However, I have run into another interesting problem in the process...

In the "Table of Contents" entry above "1.1 Blah words.............. 42" there appears to be an unusual character between the paragraph number ("1.1 "; + trailing space) and the next text ("Blah words...."). In Word with show/hide turned ON, the character is like a paragraph marker, but is different, it looks like a backword capital "P"). As near as I can figure out from Google, it is (sometimes) called a "pilcrow" and, apparently, is used in Microsoft Word to mark an "indent" that isn't a tab.

The problem, of course, is that when iterating through the doc.Paragraph list, in the "Table of Content" region, the text "1.1 Blah words.............. 42" is interpreted as two (2) paragraphs: "1.1 ", then "Blah words.............. 42".

This pilcrow/indent character cannot be selected and removed via search and replace as it results in undesirable changes elsewhere to the Word document. I am not at all sure if this weird pilcrow/indent character can be detected and removed so that the visual appearance of the text in the ToC is interpreted in a "natural way" via python-docx, but I wanted to raise the issue.

scanny commented 1 month ago

python-docx does not recognize paragraph boundaries by the presence of a particular character. The XML tells python-docx where those boundaries are. If you're seeing a pilcrow character there it's because that is indeed a separate paragraph. If you inspect the XML I expect you'll see this.

I'm not clear on exactly what you're seeing, probably a small screenshot would help.

If I wanted to skip the TOC if present I would look for the field markers in the XML and remove whatever was in between before iterating the paragraphs in the document.

polanddm commented 1 month ago

Thank you for the quick feedback. Yes, I agree, your recommendation to look at XML field markers to "skip" the TOC is probably the best approach.

Regarding the screenshot, see below. It is kind of weird as the "symbol" is not strictly consistent (which was the somewhat maddening part).

image

At any rate, the problems are surmountable, but kind of make my python code a little messy and less generic. The client documents I am processing are very... small we say... "diverse" in their mature use of Word styling features.

Frankly (and I know how this sounds), I have wondered if just traversing a "doc.Characters" list would be easier sometimes.

Thanks again!

scanny commented 1 month ago

@polanddm Yeah, that is a little weird looking, but I think that's just a rendering artifact where the e.g. "Program Introduction ..." text is overlapping a little and obscuring the right-hand side of the regular paragraph character. If you widen that tab setting a little I think that will reveal the rest of it.

The actual content of the TOC is in the document btw. The TOC feature generates that content and inserts it when the field is refreshed. So you should be able to see what's actually in there by inspecting the XML for it. You can find it by unzipping the .docx file (DOCX is a zip-archive) and then inspecting the document.xml file within it.

Closing for now as not actionable, but feel free to ask more questions in this issue if you need to.