python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.58k stars 1.12k forks source link

xpath on single node <w:lastRenderedPageBreak/> #211

Open mastier opened 9 years ago

mastier commented 9 years ago

I want to use xpath to search for element taken from document. Is the tag "" broken ? Should it be just "" ? ( with space before "/>" , so in result Micro$oft fault :grin: ) Because I cannot look it up it with xpath()

>>> from docx import Document
>>> d = Document('songs.docx')
>>> d.element.xpath('//w:lastRenderedPageBreak')
[]

Here is the excerpt from word/document.xml

<w:p w:rsidR="00142947" w:rsidRDefault="00DF39E9">
  <w:pPr>
    <w:pStyle w:val="Nagwek1"/>
    <w:numPr>
      <w:ilvl w:val="0"/>
      <w:numId w:val="1"/>
    </w:numPr>
  </w:pPr>
  <w:bookmarkStart w:id="7" w:name="_Toc414712600"/>
  <w:bookmarkStart w:id="8" w:name="__RefHeading__17378_533069858"/>
  <w:bookmarkStart w:id="9" w:name="_Toc415065229"/>
  <w:bookmarkEnd w:id="7"/>
  <w:bookmarkEnd w:id="8"/>
  <w:r>
    <w:lastRenderedPageBreak/>
    <w:t>5’nizza - Soldat</w:t>
  </w:r>
  <w:bookmarkEnd w:id="9"/>
</w:p>
scanny commented 9 years ago

Hmm, looks like that should work. The <w:lastRenderedPageBreak/> tag won't show up in every document, so you need to keep that in mind. The result you're getting is what you'd expect for a document without any of those tags.

Does it work with the expression '//w:p'?

mastier commented 9 years ago

Yeah I know i want to parse specific one :-) Actually my aim is to retrieve 5’nizza - Soldat/w:t between bookmarks

This of course works:

In [25]: d.element.xpath('//w:p')
Out[25]: 
[<CT_P '<w:p>' at 0x7fac4451faf8>,
<CT_P '<w:p>' at 0x7fac4451fb50>,
<CT_P '<w:p>' at 0x7fac4451fc58>,
<CT_P '<w:p>' at 0x7fac4451fcb0>,
<CT_P '<w:p>' at 0x7fac4451fd08>,
<CT_P '<w:p>' at 0x7fac4451fd60>,
<CT_P '<w:p>' at 0x7fac4451fdb8>,
<CT_P '<w:p>' at 0x7fac4451fe10>,
<CT_P '<w:p>' at 0x7fac4451fe68>,
<CT_P '<w:p>' at 0x7fac4451fec0>,
<CT_P '<w:p>' at 0x7fac4451ff18>,
<CT_P '<w:p>' at 0x7fac4451ff70>,
<CT_P '<w:p>' at 0x7fac4451ffc8>,
<CT_P '<w:p>' at 0x7fac44463050>,
<CT_P '<w:p>' at 0x7fac444630a8>,
<CT_P '<w:p>' at 0x7fac44463100>,
<CT_P '<w:p>' at 0x7fac44463158>,
<CT_P '<w:p>' at 0x7fac444631b0>,
<CT_P '<w:p>' at 0x7fac44463208>,
<CT_P '<w:p>' at 0x7fac44463260>,
<CT_P '<w:p>' at 0x7fac444632b8>,
...