Open alephpi opened 6 months ago
By some research, it seems that python-docx
(or python-opc
indeed) cannot parse tags from markup compatibility namespace, therefore the nodes are completely ignored in xml tree hence inaccessible.
@alephpi python-docx
(lxml actually) parses all XML tags. It's just that not all of them have a custom element class (like CT_Paragraph
or proxy class like Paragraph
.
Generally the approach for this sort of thing is to get as close as possible using python-docx
, like p = paragraph._p
for a <w:p>
element for example, and then use XPath to get the items of interest and lxml.etree._Element
methods to work on those elements or attributes.
e.g.
Is there any workaround to access the text inside it? If we allow sacrificing the customized formats (e.g. those used in 'wps'), can we simply remove the customized tag and only keep those inside the
mc:Fallback
to make it work forpython-docx
? If such removal is preferable, how to make sure the after the removal, the file is still readable forpython-docx
? Which tools should we choose to do the operation? i.e. can MS Word does the job or we just match it with lxml? Or do you have any other suggestions?I'm new to docx format so I expect your help!
Thank you in advance!