python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.38k stars 1.08k forks source link

Python-docx detects a single paragraph with a single run with a single style -- when it should detect multiple paragraphs each having their own style #1413

Closed philCryoport closed 2 weeks ago

philCryoport commented 3 weeks ago

Within a table in a docx, here's the content as seen in Word: image

Exported docx to mht, here's the relevant section:

 <p class=MsoNormal style='margin-top:3.0pt;margin-right:0in;margin-bottom:3.0pt;margin-left:0in'>
     <span style='font-size:10.0pt;mso-bidi-font-family:Arial;color:black;mso-color-alt:windowtext'>Paragraph without a bullet</span>
     <span style='font-size:10.0pt;mso-bidi-font-family:Arial'><o:p></o:p></span>
 </p>

 <p class=MsoListParagraphCxSpFirst style='margin-top:3.0pt;margin-right: 0in;margin-bottom:3.0pt;margin-left:.5in;mso-add-space:auto;text-indent:-.25in;mso-list:l237 level1 lfo33'>
     <![if !supportLists]>
        <span style='font-size:10.0pt;font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:Symbol'>
            <span style='mso-list:Ignore'>·
                <span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span>
            </span>
        </span>
     <![endif]>
     <span style='font-size:10.0pt;mso-bidi-font-family:Arial;color:black;mso-color-alt:windowtext'>Bulleted entry immediately below it</span>
     <span style='font-size:10.0pt;mso-bidi-font-family:Arial'><o:p></o:p></span>
 </p>

I retrieve the row of the table, grab the second cell (where this content is located) and attempt to retrieve the text.

When I look at the content in Word -- and when I look at the above HTML -- I would think python-docx would interpret it as either: (a) multiple paragraphs (one paragraph style "Normal", the other paragraph style "List Paragraph") ...or... (b) single paragraph with multiple runs (one run style "Normal", the other run style "List Paragraph")

NOPE

Python-docx sees a single paragraph with a single run -- with style "Normal"

Here, let me show you.

Is it multiple paragraphs?

def get_text(row):
    field_text = ""
    paragraph_counter = 0
    for paragraph in row.cells[1].paragraphs:
        print("paragraph counter: " + str(paragraph_counter))
        print("paragraph style: " + paragraph.style.name)
        print("paragraph text:\n[BEGIN]\n" + paragraph.text + "\n[END]\n")
        paragraph_counter += 1

NOPE: Python-docx only sees a single paragraph. Here's the output:

paragraph counter: 0
paragraph style: Normal
paragraph text:
[BEGIN]
Paragraph without a bullet.
Bulleted entry immediately below it
[END]

Is it a single paragraph with multiple runs?

def get_text(row):
    field_text = ""
    for paragraph in row.cells[1].paragraphs:
        run_counter = 0
        for run in paragraph.runs:
            print("run counter: " + str(run_counter))
            print("run style: " + run.style.name)
            print("run text:\n[BEGIN]\n" + run.text + "\n[END]\n")
            run_counter += 1

NOPE: Python-docx only sees a single run within the single paragraph. Here's the output:

run counter: 0
run style: Default Paragraph Font
run text: 
[BEGIN]
Paragraph without a bullet.
Bulleted entry immediately below it
[END]

HELP!? What am I doing wrong? How do I get Python-docx to detect this content as two separate sets of text: First: "Paragraph without a bullet." with style "Normal" Second: "Bulleted entry immediately below it" with style "List Paragraph"

scanny commented 3 weeks ago

Post the output of:

print(row.cells[1]._tc.xml)

and let's have a look at the underlying XML.

philCryoport commented 3 weeks ago

Hi @scanny

<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
  <w:tcPr>
    <w:tcW w:w="3509" w:type="dxa"/>
    <w:gridSpan w:val="2"/>
    <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF" w:themeFill="background1"/>
    <w:vAlign w:val="center"/>
  </w:tcPr>
  <w:p>
    <w:r>
      <w:t>Paragraph without a bullet.</w:t>
      <w:br/>
      <w:t>Bulleted entry immediately below it</w:t>
   </w:r>
  </w:p>
</w:tc>

IDGI -- Word says the style for the "Bulleted entry immediately below it" is: image

It's also there in the MHT file as <p class=MsoListParagraphCxSpFirst

So why doesn't python-docx see it?


DIGGING MORE: I unzipped the .docx file and looked at word/document.xml.

The non-bulleted entry has no declared style.

The bulleted entry declares ListParagraph as the style.

<w:tc>
                    <w:tcPr>
                        <w:tcW w:type="dxa" w:w="3509"/>
                        <w:gridSpan w:val="2"/>
                        <w:shd w:color="auto" w:fill="FFFFFF" w:themeFill="background1" w:val="clear"/>
                        <w:vAlign w:val="center"/>
                    </w:tcPr>
                    <w:p w14:paraId="70F0B702" w14:textId="5C8CEF2A" w:rsidP="00F472AD" w:rsidR="005E24FF" w:rsidRDefault="005E24FF" w:rsidRPr="00755C8B">
                        <w:pPr>                  <!-- NO STYLE DECLARED FOR THE NON-BULLETED TEXT ?!?-->
                            <w:spacing w:after="60"/>
                            <w:rPr>
                                <w:rFonts w:cs="Arial"/>
                                <w:sz w:val="20"/>
                                <w:szCs w:val="20"/>
                            </w:rPr>
                        </w:pPr>
                        <w:r w:rsidRPr="00755C8B">
                            <w:rPr>
                                <w:rFonts w:cs="Arial"/>
                                <w:sz w:val="20"/>
                                <w:szCs w:val="20"/>
                            </w:rPr>
                            <w:t>Paragraph without a bullet.</w:t>
                        </w:r>
                    </w:p>
                    <w:p w14:paraId="5DA9B63F" w14:textId="77777777" w:rsidP="00AB4472" w:rsidR="005E24FF" w:rsidRDefault="005E24FF" w:rsidRPr="00755C8B">
                        <w:pPr>
                            <w:pStyle w:val="ListParagraph"/> <!-- HERE IS THE LIST PARAGRAPH STYLE-->
                            <w:numPr>
                                <w:ilvl w:val="0"/>
                                <w:numId w:val="61"/>
                            </w:numPr>
                            <w:spacing w:after="60"/>
                            <w:rPr>
                                <w:rFonts w:cs="Arial"/>
                                <w:sz w:val="20"/>
                                <w:szCs w:val="20"/>
                            </w:rPr>
                        </w:pPr>
                        <w:r w:rsidRPr="00755C8B">
                            <w:rPr>
                                <w:rFonts w:cs="Arial"/>
                                <w:sz w:val="20"/>
                                <w:szCs w:val="20"/>
                            </w:rPr>
                            <w:t>Bulleted entry immediately below it</w:t>
                        </w:r>
                    </w:p>
            </w:tc>
scanny commented 2 weeks ago

@philCryoport I'm not sure what to tell you. The paragraph in memory only has a single paragraph:

<w:tc>
  <w:tcPr>...</w:tcPr>
  <w:p>
    <w:r>
      <w:t>Paragraph without a bullet.</w:t>
      <w:br/>
      <w:t>Bulleted entry immediately below it</w:t>
    </w:r>
  </w:p>
</w:tc>

This corresponds to what python-docx is reporting.

python-docx does not automatically modify the underlying XML, so what you're seeing when you print out element.xml is exactly what is there in the file unless your code modified it some way.

Transformations to MHT are something else entirely and have nothing to do with python-docx. python-docx reads the .docx file, manipulates it as instructed, and saves it when instructed.

I'd be checking my code to make sure I was loading the file I thought I was and not a prior or later version. Also, if there is more than one version of that table, like within revision marks or as alternate content that could have something to do with it.

philCryoport commented 2 weeks ago

Hi @scanny -- here's a pared down file. I created it by starting a new document and then only copying over the table in question.

reduced.docx

scanny commented 2 weeks ago

This code:

document = Document("reduced.docx")
cell = document.tables[0].rows[1].cells[1]
for p in cell.paragraphs:
    print(f"{p.style.name=} : {p.text=}")

produces this output:

p.style.name='Normal' : p.text='Paragraph without a bullet.'
p.style.name='List Paragraph' : p.text='Bulleted entry immediately below it'
p.style.name='Normal' : p.text=''
p.style.name='Normal' : p.text=''

which is the expected behavior and does not match what you posted above.

I think I'm going to have to leave this with you to figure out. I don't see any unexpected python-docx behavior.

philCryoport commented 2 weeks ago

Yup, you're right. Now I gotta see what I'm doing in my existing code that's stripping the formatting out of the cell :(

Thank you @scanny for your help!

scanny commented 2 weeks ago

If you're setting cell.text, that's going to produce a single paragraph, as noted in the documentation here: https://python-docx.readthedocs.io/en/latest/api/table.html#docx.table._Cell.text

philCryoport commented 2 weeks ago

If you're setting cell.text, that's going to produce a single paragraph, as noted in the documentation here: python-docx.readthedocs.io/en/latest/api/table.html

Ding ding ding ding ding! Give @scanny a cigar!

Yep, I was messing with cell.text to "clean up" a row cell-by-cell (removing pre- and post- carriage returns, for instance).

Again, thank you @scanny for your responsiveness to my questions!

scanny commented 2 weeks ago

Glad you got it working @philCryoport :)