`doc.paragraphs` seems not including contents inside a `<mc:AlternateContent>` tag

alephpi commented 6 months ago

e.g.

        <mc:AlternateContent>
            <mc:Choice Requires="wps">
                <w:drawing>
                    <wp:anchor allowOverlap="true" layoutInCell="true" locked="false"
                        simplePos="false" behindDoc="false" relativeHeight="251659264" distL="0"
                        distR="0" distT="0" distB="0">
                        <wp:simplePos x="0" y="0" />
                        <wp:positionH relativeFrom="column">
                            <wp:posOffset>9512300</wp:posOffset>
                        </wp:positionH>
                        <wp:positionV relativeFrom="paragraph">
                            <wp:posOffset>0</wp:posOffset>
                        </wp:positionV>
                        <wp:extent cx="2603500" cy="1422400" />
                        <wp:wrapTopAndBottom />
                        <wp:docPr id="17" name="文本框 17" />
                        <a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
                            <a:graphicData
                                uri="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
                                <wps:wsp>
                                    <wps:cNvSpPr txBox="true" />
                                    <wps:spPr>
                                        <a:xfrm>
                                            <a:off x="0" y="0" />
                                            <a:ext cx="2603500" cy="1422400" />
                                        </a:xfrm>
                                        <a:prstGeom prst="rect">
                                            <a:avLst />
                                        </a:prstGeom>
                                        <a:noFill />
                                        <a:ln w="6350">
                                            <a:noFill />
                                        </a:ln>
                                    </wps:spPr>
                                    <wps:txbx>
                                        <w:txbxContent>
                                            <w:p>
                                                <w:pPr>
                                                    <w:wordWrap w:val="on" />
                                                    <w:autoSpaceDE w:val="off" />
                                                    <w:autoSpaceDN w:val="off" />
                                                    <w:spacing w:before="0" w:after="0"
                                                        w:line="2240" w:lineRule="atLeast" />
                                                    <w:ind w:left="0" w:right="0" />
                                                    <w:jc w:val="both" />
                                                    <w:textAlignment w:val="auto" />
                                                    <w:rPr>
                                                        <w:sz w:val="136" />
                                                    </w:rPr>
                                                </w:pPr>
                                                <w:r>
                                                    <w:rPr>
                                                        <w:rFonts w:ascii="宋体" w:hAnsi="宋体"
                                                            w:cs="宋体" w:eastAsia="宋体" />
                                                        <w:sz w:val="136" />
                                                        <w:color w:val="000000" />
                                                        <w:b w:val="off" />
                                                        <w:i w:val="off" />
                                                        <w:strike w:val="off" />
                                                    </w:rPr>
                                                    <w:t>许嘉璐</w:t>
                                                </w:r>
                                            </w:p>
                                        </w:txbxContent>
                                    </wps:txbx>
                                    <wps:bodyPr wrap="square" lIns="0" rIns="0" tIns="0" bIns="0"
                                        vert="horz" anchor="t">
                                        <a:spAutoFit />
                                    </wps:bodyPr>
                                </wps:wsp>
                            </a:graphicData>
                        </a:graphic>
                    </wp:anchor>
                </w:drawing>
            </mc:Choice>
            <mc:Fallback>
                <w:pict>
                    <v:shape type="#_x0000_t202" filled="f" stroked="f"
                        style="margin-left:749pt;margin-top:0pt;width:205pt;height:112pt;mso-position-vertical:absolute;mso-position-vertical-relative:text;mso-position-horizontal:absolute;mso-position-horizontal-relative:text;mso-wrap-style:square;position:absolute;v-text-anchor:top;">
                        <w10:wrap type="topAndBottom" />
                        <v:textbox inset="0pt,0pt,0pt,0pt" style="mso-fit-shape-to-text:t">
                            <w:txbxContent>
                                <w:p>
                                    <w:pPr>
                                        <w:wordWrap w:val="on" />
                                        <w:autoSpaceDE w:val="off" />
                                        <w:autoSpaceDN w:val="off" />
                                        <w:spacing w:before="0" w:after="0" w:line="2240"
                                            w:lineRule="atLeast" />
                                        <w:ind w:left="0" w:right="0" />
                                        <w:jc w:val="both" />
                                        <w:textAlignment w:val="auto" />
                                        <w:rPr>
                                            <w:sz w:val="136" />
                                        </w:rPr>
                                    </w:pPr>
                                    <w:r>
                                        <w:rPr>
                                            <w:rFonts w:ascii="宋体" w:hAnsi="宋体" w:cs="宋体"
                                                w:eastAsia="宋体" />
                                            <w:sz w:val="136" />
                                            <w:color w:val="000000" />
                                            <w:b w:val="off" />
                                            <w:i w:val="off" />
                                            <w:strike w:val="off" />
                                        </w:rPr>
                                        <w:t>许嘉璐</w:t>
                                    </w:r>
                                </w:p>
                            </w:txbxContent>
                        </v:textbox>
                    </v:shape>
                </w:pict>
            </mc:Fallback>
        </mc:AlternateContent>

Is there any workaround to access the text inside it? If we allow sacrificing the customized formats (e.g. those used in 'wps'), can we simply remove the customized tag and only keep those inside the mc:Fallback to make it work for python-docx? If such removal is preferable, how to make sure the after the removal, the file is still readable for python-docx? Which tools should we choose to do the operation? i.e. can MS Word does the job or we just match it with lxml? Or do you have any other suggestions?

I'm new to docx format so I expect your help!

Thank you in advance!

alephpi commented 6 months ago

By some research, it seems that python-docx (or python-opc indeed) cannot parse tags from markup compatibility namespace, therefore the nodes are completely ignored in xml tree hence inaccessible.

scanny commented 6 months ago

@alephpi python-docx (lxml actually) parses all XML tags. It's just that not all of them have a custom element class (like CT_Paragraph or proxy class like Paragraph.

Generally the approach for this sort of thing is to get as close as possible using python-docx, like p = paragraph._p for a <w:p> element for example, and then use XPath to get the items of interest and lxml.etree._Element methods to work on those elements or attributes.

python-openxml / python-docx

`doc.paragraphs` seems not including contents inside a `<mc:AlternateContent>` tag #1389