python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.58k stars 1.12k forks source link

Feature: Read checkboxes in Word forms #224

Open Radcliffe opened 8 years ago

Radcliffe commented 8 years ago

I am trying to programmatically extract form data from a large number of Word documents. I am able to extract the text data, but not the checkboxes. Is it possible to determine if a checkbox has been checked using python-docx? If so, could someone show me some sample code? Otherwise, what tool would you recommend? Thanks!

Radcliffe commented 8 years ago

I solved the problem by dropping python-docx and using lxml and zipfile directly. I hadn't realized that a Microsoft Word document is just a zipped archive of XML files! But it would be nice if python-docx had support for reading form inputs.

scanny commented 8 years ago

@Radcliffe for this sort of thing most folks use python-docx to do the heavy lifting of getting you close in the XML hierarchy, then going to lxml calls for any unimplemented bits. How and where do checkboxes show up in the XML? It could be handy for other folks to know who come across this requirement :)

jdell64 commented 7 years ago

Having the same issue here... My checkboxes are in a table cell. I have the cell object now. Can you get the raw data from the cell object?

jdell64 commented 7 years ago

I have the following code using the lxml lib: import zipfile from lxml import etree


def get_word_xml(docx_filename):
    with open(docx_filename) as f:
        zip = zipfile.ZipFile(f)
        xml_content = zip.read('word/document.xml')
    return xml_content

# def _check_element_is(self, element, type_char):
def _check_element_is(element, type_char):
    word_schema = 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'
    return element.tag == '{%s}%s' % (word_schema, type_char)

xml_content = get_word_xml('./example.docx')
xml_tree = etree.fromstring(xml_content)

def isChecked(checkbox):
    val = False
    for child in checkbox:
        if _check_element_is(child, 'checked'):
            val = True
            return val
    return val

def checkboxValuesInElement(el):
    retVal = {}
    i = 0
    for child in el:
        if _check_element_is(child, 'checkBox'):
            retVal[i] = isChecked(child)
            i += 1
    return retVal

I tried to fork this project and make a pull request, but I couldn't figure out how to contribute. Is there docs on that?

Otherwise, a checkbox generally looks like this:

<w:checkBox>
   <w:sizeAuto />
   <w:default w:val="0">
   <w:checked/>
</w:checkBox>

and it only has the <w:checked/> child if it is checked. Otherwise it is unchecked. Mine happen to be in a table, but it could be in a body as well.

Other possible enhancements I thought might be nice would be to get the path to an element to pass that into lxml if that is more suited for it. It would be nice to be able to start with these objects though, like, Table.raw would give the xml for the Table and all it's child elements.

scanny commented 7 years ago

Getting a commit on a new feature starts with writing an enhancement proposal, aka. an "analysis page". This is a recent one, although they're not all that long: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html

It also requires acceptance and unit tests in addition to the code.

We'd need to know the ancestors of the w:checkBox element to understand where it might end up in the API.

python-docx can do a lot of the heavy lifting for you on this sort of thing. This is roughly equivalent to your 25 lines or so:

from docx.oxml.ns import qn

document = Document('example.docx')
doc_elm = document._element
checkBoxes = doc_elm.xpath('.//w:checkBox')
for checkBox in checkBoxes:
    print('checkBox value is %s' % checkBox.get(qn(w:val)))
jdell64 commented 7 years ago

Ok cool. So the only issue then would be marrying the checkbox to a string value somewhere. Do these nodes have a .children .parent or .siblings property?

scanny commented 7 years ago

Each element is a subclass of lxml.etree._Element, so all the members on that class are available to you: http://lxml.de/3.7/api/index.html

I might start by understanding the ancestors a bit better, like maybe printing out the XML for a paragraph that contains one:

document = Document('example.docx')
for paragraph in document.paragraphs:
    p = paragraph._element
    checkBoxes = p.xpath('.//w:checkBox')
    if checkBoxes:
        print(p.xml)
        break
wailoktam commented 6 years ago

Hi, Scanny,

Thanks for your great work. I try the checkbox code. But I notice that my document.xml does not use w:checkBox. Instead, it uses w14:checkbox. I try p.xpath('.//w14:checkbox') but it complains about

File "/home/wailoktam/.local/lib/python3.5/site-packages/docx/oxml/xmlchemy.py", line 751, in xpath xpath_str, namespaces=nsmap File "src/lxml/etree.pyx", line 1577, in lxml.etree._Element.xpath File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.call File "src/lxml/xpath.pxi", line 227, in lxml.etree._XPathEvaluatorBase._handle_result lxml.etree.XPathEvalError: Undefined namespace prefix

I try

checkBoxes = p.xpath('.//w14:checkbox', namespaces={'w14':'http://schemas.microsoft.com/office/word/2010/wordml'})

But then it complains about having the extra namespaces keyword argument.

What have I done wrong?

Many thanks in advance.

ianjcraig commented 5 years ago

@wailoktam did you ever get a resolution to this? I'm solving the same problem, and I come to exactly the same conclusion you do in terms of how to adapt Scanny's code.

wailoktam commented 5 years ago

@wailoktam did you ever get a resolution to this? I'm solving the same problem, and I come to exactly the same conclusion you do in terms of how to adapt Scanny's code.

This is what I do: (21xx) is line number from the history command:

for python-docx to handle checkbox tag:

cd python3.5/site-packages/docx/oxml 2133 cd .local/lib/python3.5/site-packages/docx/oxml 2134 ls 2135 cat ns.py

add to nsmap: 'w14': ('http://schemas.microsoft.com/office/word/2010/wordml')

Hope it helps.

ianjcraig commented 5 years ago

@wailoktam Thanks!

ianjcraig commented 5 years ago

Just wanted to leave some comments after I got my program working for anyone else who ends up here. 1) For me lxml was the best module to make this work. I can imagine that docx works great as well maybe better, but it seemed like I had to get to know the lxml module anyways, and so I just stayed in that. 2) To interact with the "w14" tags, i needed to modify the xml file prior to creating the xml tree using lxml. I used the following commands to read the file, modify the xml, and create the XML tree. The key line of code that allows lxml to interact with the w14 tag is where I removed the attribute "mc:Ignorable="w14 w15 wp14" "

from zipfile import ZipFile
from lxml import etree
import tempfile

tmp_dir = tempfile.mkdtemp()
with ZipFile(self.template_file_name) as myzip:
       self.filenames = myzip.namelist()
       myzip.extractall(self.tmp_dir)
file_name = tmp_dir+'\word\document.xml'
xml_content = open(file_name,'r', encoding="utf8").read()
xml_content = xml_content.replace('mc:Ignorable="w14 w15 wp14"', '').encode('utf-8')
parser = etree.XMLParser(recover=True, encoding='utf-8')
self.tree = etree.parse(io.BytesIO(xml_content), parser)
self.root = self.tree.getroot()
  1. The tags identified with the w14 prefix are only part of the xml tags associated with these controls. The check boxes start with which is a standard tag for a custom function in the OOXML tag reference. It stands for "structured document tags". You can find out more about sdt tags here: image
  2. There are 2 methods that can be used to read the check boxes as their value is stored in 2 locations. For one of them you don't need to worry about reading tags prefixed with "w14". The value is stored in the following 2 locations (using XPATH notation, you'll likely want to learn how XPATH works). a) \w:sdt\w:sdtPr\w14:checkbox\w14:checked[@w14:val="1"] b) \w:sdt\w:sdtContent\w:r\w:t For further clairty I'll paste example xml code below of the sdt element:
    <w:sdt>
    <w:sdtPr>
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
            <w:sz w:val="22"/>
            <w:szCs w:val="22"/>
        </w:rPr>
        <w:alias w:val="user defined checkbox name"/>
        <w:tag w:val="user_defined_tag_name"/>
        <w:id w:val="163048189"/>
        <w:lock w:val="sdtLocked"/>
        <w14:checkbox>
            <w14:checked w14:val="1"/>
            <w14:checkedState w14:val="2612" w14:font="MS Gothic"/>
            <w14:uncheckedState w14:val="2610" w14:font="MS Gothic"/>
        </w14:checkbox>
    </w:sdtPr>
    <w:sdtEndPr/>
    <w:sdtContent>
        <w:r>
            <w:rPr>
                <w:rFonts w:ascii="MS Gothic" w:eastAsia="MS Gothic" w:hAnsi="MS Gothic" w:cs="Arial" w:hint="eastAsia"/>
                <w:sz w:val="22"/>
                <w:szCs w:val="22"/>
            </w:rPr>
            <w:t>☒</w:t>
        </w:r>
    </w:sdtContent>
    </w:sdt>
  3. This is likely obvious but to those wanting to control the checkbox you need to to modify both tags highlighted above. As one would expect, 1 equals checked ('☒'), 0 equals unchecked ('☐').
  4. If you are modifying documents, I found the simplest method was to unzip everything to a temporary folder, modify the document.xml file, and then rezip everything together. The code above takes care of the uzipping into a temporary file. If you're just reading the word document, the code above by @Radcliffe to access the document.xml file works great (thanks jdell64, that's what I started with). The code associated with rezipping and then cleaning up the temporary files is as follows:
        import os
        import shutil

        with open(os.path.join(self.tmp_dir, 'word/document.xml'), 'w') as f:
            xmlstr = etree.tostring(self.tree, pretty_print=True)
            f.write(xmlstr.decode('utf-8'))

        # Create the new zip file and add all the filex into the archive
        zip_copy_filename = self.output_file_name
        with ZipFile(zip_copy_filename, "w") as docx_file:
            for filename in self.filenames:
                docx_file.write(os.path.join(self.tmp_dir, filename), filename)

        # Clean up the temp dir
        shutil.rmtree(self.tmp_dir)

I hope the above is helpful to someone. I apologize if the code doesn't copy and paste and simply work. I inferred the necessary code form my own trying to eliminate unnecessary detail. Let me know if it doesn't and I'll see if I can figure out where any error is.

retsyo commented 2 years ago

my bad. I insert a checkbox but from active controls. However, a simple checkbox should be used.

my bad. but I do not delete the following in case someone makes a similar
wrong steps

I read a simple docx, where I put 1 checkbox in the text, 2 checkboxes in the table's cells. however I found the xml file does not match what have been discussed above. So there is no universal solution for different version DOCX, isn't it?

<w:control r:id="rId6" w:name="CheckBox2" w:shapeid="_x0000_i1026"/>
</w:object></w:r></w:p></w:tc><w:tc><w:tcPr><w:tcW w:w="1380" w:type="dxa"/></w:tcPr><w:p><w:r><w:object><v:shape id="_x0000_i1027" o:spt="201" alt="" type="#_x0000_t201" style="height:35.4pt;width:58.2pt;" o:ole="t" filled="f" o:preferrelative="t" stroked="f" coordsize="21600,21600"><v:path/><v:fill on="f" focussize="0,0"/><v:stroke on="f"/><v:imagedata r:id="rId9" o:title=""/><o:lock v:ext="edit" aspectratio="f"/><w10:wrap type="none"/><w10:anchorlock/></v:shape>

<w:control r:id="rId8" w:name="CheckBox3" w:shapeid="_x0000_i1027"/>
</w:object></w:r></w:p></w:tc></w:tr></w:tbl><w:p><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="851" w:footer="992" w:gutter="0"/><w:cols w:space="425" w:num=[checkbox.docx](https://github.com/python-openxml/python-docx/files/7776019/checkbox.docx)
a-ledu commented 2 years ago

+1