Closed philCryoport closed 2 weeks ago
Post the output of:
print(row.cells[1]._tc.xml)
and let's have a look at the underlying XML.
Hi @scanny
<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
<w:tcPr>
<w:tcW w:w="3509" w:type="dxa"/>
<w:gridSpan w:val="2"/>
<w:shd w:val="clear" w:color="auto" w:fill="FFFFFF" w:themeFill="background1"/>
<w:vAlign w:val="center"/>
</w:tcPr>
<w:p>
<w:r>
<w:t>Paragraph without a bullet.</w:t>
<w:br/>
<w:t>Bulleted entry immediately below it</w:t>
</w:r>
</w:p>
</w:tc>
IDGI -- Word says the style for the "Bulleted entry immediately below it" is:
It's also there in the MHT file as <p class=MsoListParagraphCxSpFirst
So why doesn't python-docx
see it?
DIGGING MORE:
I unzipped the .docx file and looked at word/document.xml
.
The non-bulleted entry has no declared style.
The bulleted entry declares ListParagraph
as the style.
<w:tc>
<w:tcPr>
<w:tcW w:type="dxa" w:w="3509"/>
<w:gridSpan w:val="2"/>
<w:shd w:color="auto" w:fill="FFFFFF" w:themeFill="background1" w:val="clear"/>
<w:vAlign w:val="center"/>
</w:tcPr>
<w:p w14:paraId="70F0B702" w14:textId="5C8CEF2A" w:rsidP="00F472AD" w:rsidR="005E24FF" w:rsidRDefault="005E24FF" w:rsidRPr="00755C8B">
<w:pPr> <!-- NO STYLE DECLARED FOR THE NON-BULLETED TEXT ?!?-->
<w:spacing w:after="60"/>
<w:rPr>
<w:rFonts w:cs="Arial"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="00755C8B">
<w:rPr>
<w:rFonts w:cs="Arial"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:t>Paragraph without a bullet.</w:t>
</w:r>
</w:p>
<w:p w14:paraId="5DA9B63F" w14:textId="77777777" w:rsidP="00AB4472" w:rsidR="005E24FF" w:rsidRDefault="005E24FF" w:rsidRPr="00755C8B">
<w:pPr>
<w:pStyle w:val="ListParagraph"/> <!-- HERE IS THE LIST PARAGRAPH STYLE-->
<w:numPr>
<w:ilvl w:val="0"/>
<w:numId w:val="61"/>
</w:numPr>
<w:spacing w:after="60"/>
<w:rPr>
<w:rFonts w:cs="Arial"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="00755C8B">
<w:rPr>
<w:rFonts w:cs="Arial"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:t>Bulleted entry immediately below it</w:t>
</w:r>
</w:p>
</w:tc>
@philCryoport I'm not sure what to tell you. The paragraph in memory only has a single paragraph:
<w:tc>
<w:tcPr>...</w:tcPr>
<w:p>
<w:r>
<w:t>Paragraph without a bullet.</w:t>
<w:br/>
<w:t>Bulleted entry immediately below it</w:t>
</w:r>
</w:p>
</w:tc>
This corresponds to what python-docx
is reporting.
python-docx
does not automatically modify the underlying XML, so what you're seeing when you print out element.xml
is exactly what is there in the file unless your code modified it some way.
Transformations to MHT are something else entirely and have nothing to do with python-docx
. python-docx
reads the .docx file, manipulates it as instructed, and saves it when instructed.
I'd be checking my code to make sure I was loading the file I thought I was and not a prior or later version. Also, if there is more than one version of that table, like within revision marks or as alternate content that could have something to do with it.
Hi @scanny -- here's a pared down file. I created it by starting a new document and then only copying over the table in question.
This code:
document = Document("reduced.docx")
cell = document.tables[0].rows[1].cells[1]
for p in cell.paragraphs:
print(f"{p.style.name=} : {p.text=}")
produces this output:
p.style.name='Normal' : p.text='Paragraph without a bullet.'
p.style.name='List Paragraph' : p.text='Bulleted entry immediately below it'
p.style.name='Normal' : p.text=''
p.style.name='Normal' : p.text=''
which is the expected behavior and does not match what you posted above.
I think I'm going to have to leave this with you to figure out. I don't see any unexpected python-docx
behavior.
Yup, you're right. Now I gotta see what I'm doing in my existing code that's stripping the formatting out of the cell :(
Thank you @scanny for your help!
If you're setting cell.text
, that's going to produce a single paragraph, as noted in the documentation here: https://python-docx.readthedocs.io/en/latest/api/table.html#docx.table._Cell.text
If you're setting
cell.text
, that's going to produce a single paragraph, as noted in the documentation here: python-docx.readthedocs.io/en/latest/api/table.html
Ding ding ding ding ding! Give @scanny a cigar!
Yep, I was messing with cell.text
to "clean up" a row cell-by-cell (removing pre- and post- carriage returns, for instance).
Again, thank you @scanny for your responsiveness to my questions!
Glad you got it working @philCryoport :)
Within a table in a docx, here's the content as seen in Word:
Exported docx to mht, here's the relevant section:
I retrieve the row of the table, grab the second cell (where this content is located) and attempt to retrieve the text.
When I look at the content in Word -- and when I look at the above HTML -- I would think python-docx would interpret it as either: (a) multiple paragraphs (one paragraph style "Normal", the other paragraph style "List Paragraph") ...or... (b) single paragraph with multiple runs (one run style "Normal", the other run style "List Paragraph")
NOPE
Python-docx sees a single paragraph with a single run -- with style "Normal"
Here, let me show you.
Is it multiple paragraphs?
NOPE: Python-docx only sees a single paragraph. Here's the output:
Is it a single paragraph with multiple runs?
NOPE: Python-docx only sees a single run within the single paragraph. Here's the output:
HELP!? What am I doing wrong? How do I get Python-docx to detect this content as two separate sets of text: First: "Paragraph without a bullet." with style "Normal" Second: "Bulleted entry immediately below it" with style "List Paragraph"