Open Radcliffe opened 8 years ago
I solved the problem by dropping python-docx
and using lxml
and zipfile
directly. I hadn't realized that a Microsoft Word document is just a zipped archive of XML files! But it would be nice if python-docx
had support for reading form inputs.
@Radcliffe for this sort of thing most folks use python-docx to do the heavy lifting of getting you close in the XML hierarchy, then going to lxml calls for any unimplemented bits. How and where do checkboxes show up in the XML? It could be handy for other folks to know who come across this requirement :)
Having the same issue here... My checkboxes are in a table cell. I have the cell object now. Can you get the raw data from the cell object?
I have the following code using the lxml lib: import zipfile from lxml import etree
def get_word_xml(docx_filename):
with open(docx_filename) as f:
zip = zipfile.ZipFile(f)
xml_content = zip.read('word/document.xml')
return xml_content
# def _check_element_is(self, element, type_char):
def _check_element_is(element, type_char):
word_schema = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
return element.tag == '{%s}%s' % (word_schema, type_char)
xml_content = get_word_xml('./example.docx')
xml_tree = etree.fromstring(xml_content)
def isChecked(checkbox):
val = False
for child in checkbox:
if _check_element_is(child, 'checked'):
val = True
return val
return val
def checkboxValuesInElement(el):
retVal = {}
i = 0
for child in el:
if _check_element_is(child, 'checkBox'):
retVal[i] = isChecked(child)
i += 1
return retVal
I tried to fork this project and make a pull request, but I couldn't figure out how to contribute. Is there docs on that?
Otherwise, a checkbox generally looks like this:
<w:checkBox>
<w:sizeAuto />
<w:default w:val="0">
<w:checked/>
</w:checkBox>
and it only has the <w:checked/>
child if it is checked. Otherwise it is unchecked. Mine happen to be in a table, but it could be in a body as well.
Other possible enhancements I thought might be nice would be to get the path to an element to pass that into lxml if that is more suited for it. It would be nice to be able to start with these objects though, like, Table.raw
would give the xml for the Table and all it's child elements.
Getting a commit on a new feature starts with writing an enhancement proposal, aka. an "analysis page". This is a recent one, although they're not all that long: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html
It also requires acceptance and unit tests in addition to the code.
We'd need to know the ancestors of the w:checkBox
element to understand where it might end up in the API.
python-docx
can do a lot of the heavy lifting for you on this sort of thing. This is roughly equivalent to your 25 lines or so:
from docx.oxml.ns import qn
document = Document('example.docx')
doc_elm = document._element
checkBoxes = doc_elm.xpath('.//w:checkBox')
for checkBox in checkBoxes:
print('checkBox value is %s' % checkBox.get(qn(w:val)))
Ok cool. So the only issue then would be marrying the checkbox to a string value somewhere. Do these nodes have a .children
.parent
or .siblings
property?
Each element is a subclass of lxml.etree._Element
, so all the members on that class are available to you: http://lxml.de/3.7/api/index.html
I might start by understanding the ancestors a bit better, like maybe printing out the XML for a paragraph that contains one:
document = Document('example.docx')
for paragraph in document.paragraphs:
p = paragraph._element
checkBoxes = p.xpath('.//w:checkBox')
if checkBoxes:
print(p.xml)
break
Hi, Scanny,
Thanks for your great work. I try the checkbox code. But I notice that my document.xml does not use w:checkBox. Instead, it uses w14:checkbox. I try p.xpath('.//w14:checkbox') but it complains about
File "/home/wailoktam/.local/lib/python3.5/site-packages/docx/oxml/xmlchemy.py", line 751, in xpath xpath_str, namespaces=nsmap File "src/lxml/etree.pyx", line 1577, in lxml.etree._Element.xpath File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.call File "src/lxml/xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result lxml.etree.XPathEvalError: Undefined namespace prefix
I try
checkBoxes = p.xpath('.//w14:checkbox', namespaces={'w14':'http://schemas.microsoft.com/office/word/2010/wordml'})
But then it complains about having the extra namespaces keyword argument.
What have I done wrong?
Many thanks in advance.
@wailoktam did you ever get a resolution to this? I'm solving the same problem, and I come to exactly the same conclusion you do in terms of how to adapt Scanny's code.
@wailoktam did you ever get a resolution to this? I'm solving the same problem, and I come to exactly the same conclusion you do in terms of how to adapt Scanny's code.
This is what I do: (21xx) is line number from the history command:
for python-docx to handle checkbox tag:
cd python3.5/site-packages/docx/oxml 2133 cd .local/lib/python3.5/site-packages/docx/oxml 2134 ls 2135 cat ns.py
add to nsmap: 'w14': ('http://schemas.microsoft.com/office/word/2010/wordml')
Hope it helps.
@wailoktam Thanks!
Just wanted to leave some comments after I got my program working for anyone else who ends up here. 1) For me lxml was the best module to make this work. I can imagine that docx works great as well maybe better, but it seemed like I had to get to know the lxml module anyways, and so I just stayed in that. 2) To interact with the "w14" tags, i needed to modify the xml file prior to creating the xml tree using lxml. I used the following commands to read the file, modify the xml, and create the XML tree. The key line of code that allows lxml to interact with the w14 tag is where I removed the attribute "mc:Ignorable="w14 w15 wp14" "
from zipfile import ZipFile
from lxml import etree
import tempfile
tmp_dir = tempfile.mkdtemp()
with ZipFile(self.template_file_name) as myzip:
self.filenames = myzip.namelist()
myzip.extractall(self.tmp_dir)
file_name = tmp_dir+'\word\document.xml'
xml_content = open(file_name,'r', encoding="utf8").read()
xml_content = xml_content.replace('mc:Ignorable="w14 w15 wp14"', '').encode('utf-8')
parser = etree.XMLParser(recover=True, encoding='utf-8')
self.tree = etree.parse(io.BytesIO(xml_content), parser)
self.root = self.tree.getroot()
<w:sdt>
<w:sdtPr>
<w:rPr>
<w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
</w:rPr>
<w:alias w:val="user defined checkbox name"/>
<w:tag w:val="user_defined_tag_name"/>
<w:id w:val="163048189"/>
<w:lock w:val="sdtLocked"/>
<w14:checkbox>
<w14:checked w14:val="1"/>
<w14:checkedState w14:val="2612" w14:font="MS Gothic"/>
<w14:uncheckedState w14:val="2610" w14:font="MS Gothic"/>
</w14:checkbox>
</w:sdtPr>
<w:sdtEndPr/>
<w:sdtContent>
<w:r>
<w:rPr>
<w:rFonts w:ascii="MS Gothic" w:eastAsia="MS Gothic" w:hAnsi="MS Gothic" w:cs="Arial" w:hint="eastAsia"/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
</w:rPr>
<w:t>☒</w:t>
</w:r>
</w:sdtContent>
</w:sdt>
import os
import shutil
with open(os.path.join(self.tmp_dir, 'word/document.xml'), 'w') as f:
xmlstr = etree.tostring(self.tree, pretty_print=True)
f.write(xmlstr.decode('utf-8'))
# Create the new zip file and add all the filex into the archive
zip_copy_filename = self.output_file_name
with ZipFile(zip_copy_filename, "w") as docx_file:
for filename in self.filenames:
docx_file.write(os.path.join(self.tmp_dir, filename), filename)
# Clean up the temp dir
shutil.rmtree(self.tmp_dir)
I hope the above is helpful to someone. I apologize if the code doesn't copy and paste and simply work. I inferred the necessary code form my own trying to eliminate unnecessary detail. Let me know if it doesn't and I'll see if I can figure out where any error is.
my bad. I insert a checkbox but from active controls. However, a simple checkbox should be used.
my bad. but I do not delete the following in case someone makes a similar
wrong steps
I read a simple docx, where I put 1 checkbox in the text, 2 checkboxes in the table's cells. however I found the xml
file does not match what have been discussed above. So there is no universal solution for different version DOCX, isn't it?
<w:control r:id="rId6" w:name="CheckBox2" w:shapeid="_x0000_i1026"/>
</w:object></w:r></w:p></w:tc><w:tc><w:tcPr><w:tcW w:w="1380" w:type="dxa"/></w:tcPr><w:p><w:r><w:object><v:shape id="_x0000_i1027" o:spt="201" alt="" type="#_x0000_t201" style="height:35.4pt;width:58.2pt;" o:ole="t" filled="f" o:preferrelative="t" stroked="f" coordsize="21600,21600"><v:path/><v:fill on="f" focussize="0,0"/><v:stroke on="f"/><v:imagedata r:id="rId9" o:title=""/><o:lock v:ext="edit" aspectratio="f"/><w10:wrap type="none"/><w10:anchorlock/></v:shape>
<w:control r:id="rId8" w:name="CheckBox3" w:shapeid="_x0000_i1027"/>
</w:object></w:r></w:p></w:tc></w:tr></w:tbl><w:p><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="851" w:footer="992" w:gutter="0"/><w:cols w:space="425" w:num=[checkbox.docx](https://github.com/python-openxml/python-docx/files/7776019/checkbox.docx)
+1
I am trying to programmatically extract form data from a large number of Word documents. I am able to extract the text data, but not the checkboxes. Is it possible to determine if a checkbox has been checked using python-docx? If so, could someone show me some sample code? Otherwise, what tool would you recommend? Thanks!