python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.61k stars 1.13k forks source link

feature: add field #31

Open cpetz opened 10 years ago

cpetz commented 10 years ago

Feature Suggestion/Request

Support for inserting/adding Field Codes in a word document. They are a handy feature for report generation type applications (originally intended for automatic mailout merges I believe).

In Office, they make it easy to add dynamic features to a document without getting your fingers all slimey with macros/VBA (although, if designed properly they are accessible from VBA using custom DocProperties and clever references). You work with them using text "markups" that form a restricted scripting framework (hit ctrl+F9 to get started within Word).

They can easily work with DocProperties, named elements (tables, lists, headings), and external text documents that will drive the dynamic content. You can still use styles to drive a document, but with field codes you can adjust/apply syles conditionally.But field codes are admittedly ackward to work with (odd syntax, updating, poor UI tools). Thats where python-docx needs to come in.

I think with a little love, working with field codes could actually be neat, organized, readable, and very functional. They would slip in just lile your other elements...

document.add_fcode("ASK", "chap1_caption", "Type in caption for Chaper 1")

Nesting field codes arbitrarily would be important requirement.

Instead of making custom routines in python-docx that help a user hack together structured portions of a document or specific output patterns, let us use field codes to define those structures expliciltly. And then when we export back into an Office driven workflow, all of the glue is still intact and fully functional.

The main weakness of field codes is that they are typically hidden/disabled by default in Word, but in that regard, its not much different than shipping a document with embedded macros: it is understood that you know how to interact with the extended features.

cpetz commented 10 years ago

Ive notice youve not given up on the idea of field codes, and I was wondering if there were any particular reasons they have been getting pushed back, and if there is anything I can do to help.

It seems like the XML for some of the simpler field codes, is..., quite simple. I imagine some of the more advanced codes get incrementally more difficult, especially when used recursively.

< w : fldSimple w:instr=" DOCPROPERTY "customer_name" * MERGEFORMAT " > (code woudlnt show correct in comment, so I had to pollute it with a few whitespaces.)

And to be clear, my feature request was NOT FOR YOU to evaluate these codes, only help me inject them. (Look how cumbersome they are to type, honestly, its easier in SML, because you have to wrap them in a special hidden character \ when your in Word.

Help me strategically place these in my Word docs and you will have made me smile very large. I have no intention of you attempting to go through and evaluate them for me when I save it or something. If there is any grunt work regarding field codes that is slowing you down, tell me, and I will see if I help absorb some of it.

scanny commented 10 years ago

@cpetz The first step is to understand the feature well enough to be able to design the series of steps to implement it fully. That means documenting an analysis of the features and the XML representation well enough to ground those design choices. You can see examples of analyses for other features here: http://python-docx.readthedocs.org/en/latest/dev/analysis/index.html, the restructureText source is in the source code under the /docs/dev/analysis/features directory.

There are several places to look for the inputs to that analysis; the behavior of the Word app itself, the Open XML spec, the XML schema is particularly important, and the MS API are toward the top of that list.

This particular feature is tricky in the general case because it can involve a fairly arbitrary range of text, either block elements, inline elements, or both; as the body of the field. The MS API seems to use a Range object to specify this. There is no counterpart to the Range object yet in python-pptx, so it's not clear yet how fields would be implemented in the general case.

If you want to move it along the analysis is the place to start. Once that gets far enough along to provide a solid foundation for an API proposal (sometimes called a Candidate Protocol in the analysis pages), then things can move toward implementation. Seems like the analysis pages usually take me about an hour. This one might take a little longer because of the complexities. I've kind of developed a sense for where to look for things so it might take someone beginning a bit longer.

cpetz commented 10 years ago

cant promise anything, but I will do some digging and see if I come up with anything worthwhile. thanks for direction

scanny commented 10 years ago

One idea that occurred to me after reflecting on this a bit is that "complex" fields might be made a lot easier to insert if the "boundary" begin and end elements were inserted at the same time as the bits that they bounded. I'm not sure how that would work, and it might not work in all cases, but it might be a way to add capability beyond what could be done with a <w:fldSimple> element. Just a thought as you're poking around. This is one of the things the XML Specimens part of the analysis page is good for judging :)

cpetz commented 10 years ago

@scanny One question, and a few followup thoughts (I have not gotten into Open XML, merely digging through MSDN for ~official docs):

Are we looking for COMPLETE support for ALL field codes, or is it acceptable to only work with a subset? (Many Field types have been replace with more modern tools, are probably redundant several times over, and could be excessively complex/undocumented)


It seems that many simple Field types can be represented with basic XML tags:

(from Word2007 XML) <w:fldXXX attr1="value1" attr2="document relative text"></w:fldXXX>

( According to 2003 MSDN ).

I don't believe many Fields need to utilize a Range, and even from VBA the Range for these simple cases seems to be the current insertion point. These would be ideal starting candidates for analysis and implementation.

Following the previous, glancing at the MSDN link,

There is some to be learned from the VBA handling of Fields, but honestly, I don't want to read VBA. If I get time, I might see if Visual Studio will give me anything useful if I go around inspecting certain Field-related Objects.

cpetz commented 10 years ago

Also have a look at the field special character. I believe with this element, you would run it inline like other text, and may be able to stay out of XML schema (if I am understanding its usage correctly), as if I were manually typing everything through Word. This gives easy access to the special bracket character that Fields are defined within.

I will test this further to be sure, but it could mean that python-docx provides little more than a string templating interface that helps handle some of the more ackward/repetative syntax. Leaving all other functional responsibilty on the user (poorly written field code tends to fail silently in many simple cases).

scanny commented 10 years ago

It would be okay to start with a subset. The implementation would need to be complete enough to be useful to some folks and provide a foundation for future implementation. It wouldn't necessarily have to implement the general case, but its API should be consistent with that for the general case. In other words, the API shouldn't need to change to accommodate the general case. Needing to extend the API would be okay though.

Be careful around Word 2003 and by extension MSDN 2003. That version of XML for Word is completely different from Word 2007 and later. The links you provided are for that old WordML XML dialect.

The XML Schema is really the critical link for the analysis. You can find it in the code base here: https://github.com/python-openxml/python-docx/blob/master/ref/xsd/wml.xsd#L1231

I usually load it up in my editor so I can use search to move back and forth. The other critical tool is opc-diag to examine the XML Word produces for simple examples and minor changes. Those end up in the XML Specimens part of the analysis page and basically demonstrate what you're trying to achieve with a feature.

DBGVA commented 10 years ago

If I am correct adding field code would be very useful also for creating automatic figure numbering in this case word uses the following w:fldSimple w:instr=" SEQ Figure * ARABIC " in the xml file. This corresponds to scanny information w:fldSimple indicated on the 9th of august

scanny commented 10 years ago

@DBGVA, in response to your offline email:

As I am moving along I am trying to insert automatically some figures that would get a figure number.

Looking at the code that word generate when you insert a figure caption I found the following

How can I set this up in python?

The general approach for this sort of thing is to get a reference to an 'lxml' element as close as you can to where you need to insert this element, then use native lxml calls to do the needful. http://lxml.de/api/index.html (navigate to lxml.etree _Element object API)

In this case I suppose it would be after a paragraph that contained an image. So the code would look something like this:

from docx.oxml import parse_xml
from docx.oxml.ns import nsdecls

paragraph = document.add_paragraph()
run = paragraph.add_run()
run.add_picture(...)
p = paragraph._p  # this is the actual lxml element for a paragraph
fld_xml = '<w:fldSimple %s w:instr=" SEQ Figure \* ARABIC "/>' % nsdecls('w')
fldSimple = parse_xml(fld_xml)
p.addnext(fldSimple)

I don't know the specific XML to be added in this case, I'm just using the example you provided, closing the element to make it stand-alone. The best bet is to use opc-diag to inspect a small-as-possible .docx example file you make using Word that uses the feature you're looking for.

Let me know if you need more to go on :)

DBGVA commented 10 years ago

Dear Steve,

First attempt was unsuccessful. I then installed ppc-diag using pip and after getting everything straight I tried to run this module but I keep getting on my mac

bash: opc: command not found

checking my path in ,profile and with echo $PATH every looks one

/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/opt/X11/bin

Here is the result of the pip install

DB:~ danielbertrand$ pip install opc-diag --user Requirement already satisfied (use --upgrade to upgrade): opc-diag in ./Library/Python/2.7/lib/python/site-packages Requirement already satisfied (use --upgrade to upgrade): lxml>=3.0 in ./Library/Python/2.7/lib/python/site-packages/lxml-3.4.0-py2.7-macosx-10.8-intel.egg (from opc-diag) Cleaning up... DB:~ danielbertrand$

Any suggestion

Prof Daniel Bertrand Medical Faculty Geneva Switzerland

daniel.bertrand@unige.chmailto:daniel.bertrand@unige.ch

On Oct 28, 2014, at 11:32 AM, Steve Canny notifications@github.com<mailto:notifications@github.com> wrote:

@DBGVAhttps://github.com/DBGVA, in response to your offline email:

As I am moving along I am trying to insert automatically some figures that would get a figure number.

Looking at the code that word generate when you insert a figure caption I found the following

How can I set this up in python?

The general approach for this sort of thing is to get a reference to an 'lxml' element as close as you can to where you need to insert this element, then use native lxml calls to do the needful. http://lxml.de/api/index.html (navigate to lxml.etree _Element object API)

In this case I suppose it would be after a paragraph that contained an image. So the code would look something like this:

from docx.oxml import parse_xml from docx.oxml.ns import nsdecls

paragraph = document.add_paragraph() run = paragraph.add_run() run.add_picture(...) p = paragraph._p # this is the actual lxml element for a paragraph fld_xml = '<w:fldSimple %s w:instr=" SEQ Figure * ARABIC "/>' % nsdecls('w') fldSimple = parse_xml(fld_xml) p.addnext(fldSimple)

I don't know the specific XML to be added in this case, I'm just using the example you provided, closing the element to make it stand-alone. The best bet is to use opc-diag to inspect a small-as-possible .docx example file you make using Word that uses the feature you're looking for.

Let me know if you need more to go on :)

— Reply to this email directly or view it on GitHubhttps://github.com/python-openxml/python-docx/issues/31#issuecomment-60735858.

DBGVA commented 10 years ago

Dear Steve,

Working somewhat differently, I extracted the xml from the word document using unzip and here is the difference between a blank document and a document having only one figure caption.

Now I need to understand how to implement this into the xml

Thanks in advance for reading this and for your help

Best, Daniel

Prof Daniel Bertrand Medical Faculty Geneva Switzerland

daniel.bertrand@unige.chmailto:daniel.bertrand@unige.ch

On Oct 28, 2014, at 11:32 AM, Steve Canny notifications@github.com<mailto:notifications@github.com> wrote:

@DBGVAhttps://github.com/DBGVA, in response to your offline email:

As I am moving along I am trying to insert automatically some figures that would get a figure number.

Looking at the code that word generate when you insert a figure caption I found the following

How can I set this up in python?

The general approach for this sort of thing is to get a reference to an 'lxml' element as close as you can to where you need to insert this element, then use native lxml calls to do the needful. http://lxml.de/api/index.html (navigate to lxml.etree _Element object API)

In this case I suppose it would be after a paragraph that contained an image. So the code would look something like this:

from docx.oxml import parse_xml from docx.oxml.ns import nsdecls

paragraph = document.add_paragraph() run = paragraph.add_run() run.add_picture(...) p = paragraph._p # this is the actual lxml element for a paragraph fld_xml = '<w:fldSimple %s w:instr=" SEQ Figure * ARABIC "/>' % nsdecls('w') fldSimple = parse_xml(fld_xml) p.addnext(fldSimple)

I don't know the specific XML to be added in this case, I'm just using the example you provided, closing the element to make it stand-alone. The best bet is to use opc-diag to inspect a small-as-possible .docx example file you make using Word that uses the feature you're looking for.

Let me know if you need more to go on :)

� Reply to this email directly or view it on GitHubhttps://github.com/python-openxml/python-docx/issues/31#issuecomment-60735858.

scanny commented 10 years ago

I would uninstall opc-diag and reinstall:

$ pip uninstall opc-diag
...
$ pip install opc-diag

The actual executable is 'opc', 395 bytes long, and should be installed into /usr/local/bin by pip. Your PATH looks right for finding that if it was there. Possibly it's there but not marked executable.

From the install messages it looks like pip found it already installed. Possibly the executable 'hook' file (opc) got deleted somehow earlier.

If it's still not working after reinstalling, this is what that file looks like inside:

#!/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
# EASY-INSTALL-ENTRY-SCRIPT: 'opc-diag==1.0.0','console_scripts','opc'
__requires__ = 'opc-diag==1.0.0'
import sys
from pkg_resources import load_entry_point

if __name__ == '__main__':
    sys.exit(
        load_entry_point('opc-diag==1.0.0', 'console_scripts', 'opc')()
    )

A certain amount of this is generated depending on your Python environment, in particular the shebang part on the first line. You might need to modify that to suit if it comes to this.

Let us know how you go :)

scanny commented 10 years ago

I'm not seeing the XML in your message. If you added it as an attachment, those don't come through on GitHub here.

DBGVA commented 10 years ago

Dear Steve,

Thanks for your tips. It took me a bit of time but using sudo pip install opc-diag I was finally able to get it properly installed

Below is the difference I found between a blank file and a file where I have only inserted one figure caption.

Reading these xml, I think that the figure caption is inserted with a bit more complex data such as:

Let me know if I am correct.

Best and thanks for your help

Daniel

--- Blank/docProps/app.xml

+++ Figure1/docProps/app.xml

@@ -6,16 +6,16 @@

0 1

--- Blank/docProps/core.xml

+++ Figure1/docProps/core.xml

@@ -12,7 +12,7 @@

cp:keywords/ dc:description/ cp:lastModifiedByDaniel Bertrand User/cp:lastModifiedBy

--- Blank/word/document.xml

+++ Figure1/word/document.xml

@@ -20,7 +20,21 @@

 mc:Ignorable="w14 wp14"
 >
- - - - - /w:pPr - - Figure /w:t - /w:r - - - - - /w:rPr - 1/w:t - /w:r - /w:fldSimple /w:p --- Blank/word/fontTable.xml +++ Figure1/word/fontTable.xml @@ -11,7 +11,7 @@ ``` ``` - - /w:font --- Blank/word/settings.xml +++ Figure1/word/settings.xml @@ -13,6 +13,7 @@ ``` > ``` - @@ -38,6 +39,7 @@ - /w:rsids --- Blank/word/styles.xml +++ Figure1/word/styles.xml @@ -299,4 +299,23 @@ ``` ``` /w:style - - - - - - - - - - - /w:pPr - - - - - - - /w:rPr - /w:style /w:styles --- Blank/word/stylesWithEffects.xml +++ Figure1/word/stylesWithEffects.xml @@ -312,4 +312,23 @@ ``` ``` /w:style - - - - - - - - - - - /w:pPr - - - - - - - /w:rPr - /w:style /w:styles Prof Daniel Bertrand Medical Faculty Geneva Switzerland daniel.bertrand@unige.chmailto:daniel.bertrand@unige.ch On Oct 28, 2014, at 10:51 PM, Steve Canny > wrote: I would uninstall opc-diag and reinstall: $ pip uninstall opc-diag ... $ pip install opc-diag The actual executable is 'opc', 395 bytes long, and should be installed into /usr/local/bin by pip. Your PATH looks right for finding that if it was there. Possibly it's there but not marked executable. From the install messages it looks like pip found it already installed. Possibly the executable 'hook' file (opc) got deleted somehow earlier. If it's still not working after reinstalling, this is what that file looks like inside: #!/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python # EASY-INSTALL-ENTRY-SCRIPT: 'opc-diag==1.0.0','console_scripts','opc' **requires** = 'opc-diag==1.0.0' import sys from pkg_resources import load_entry_point if **name** == '**main**': sys.exit( load_entry_point('opc-diag==1.0.0', 'console_scripts', 'opc')() ) A certain amount of this is generated depending on your Python environment, in particular the shebang part on the first line. You might need to modify that to suit if it comes to this. Let us know how you go :) — Reply to this email directly or view it on GitHubhttps://github.com/python-openxml/python-docx/issues/31#issuecomment-60838220.
quietlyconfident commented 10 years ago

This is potentially related to #99 .

scanny commented 10 years ago

@DBGVA: yes, that seems about right. If you're willing to run the 'refresh fields' command in Word after generating, you can probably leave out the innards and just place the <w:fldSimple .../> element in there. I expect Word will take care of the needful as far as placing the number text within it.

If you do need to generate it complete, you'll need to keep track of the figure numbers yourself to insert them.

DBGVA commented 10 years ago

Dear Steve,

Thanks, I'll try to move ahead and keep you informed.

Best, Daniel Prof Daniel Bertrand Medical Faculty Geneva Switzerland

daniel.bertrand@unige.chmailto:daniel.bertrand@unige.ch

On Oct 31, 2014, at 12:56 AM, Steve Canny notifications@github.com<mailto:notifications@github.com> wrote:

@DBGVAhttps://github.com/DBGVA: yes, that seems about right. If you're willing to run the 'refresh fields' command in Word after generating, you can probably leave out the innards and just place the element in there. I expect Word will take care of the needful as far as placing the number text within it.

If you do need to generate it complete, you'll need to keep track of the figure numbers yourself to insert them.

— Reply to this email directly or view it on GitHubhttps://github.com/python-openxml/python-docx/issues/31#issuecomment-61191229.

cpetz commented 9 years ago

I know I have been off the topic for a while, but I wanted to make a comment.

My original yearning for field codes in python-docx was not to improve content generation straight out of python. Rather to script-fu our way through this reasonably obnoxious 'field code meta-language' feature.

Field codes are non-VBA magic for when you have to take python out of the workflow. A great example is in a company shared Template Library. You can plant some hidden magic in there that any of your colleagues can activate (by changing 1 property, and then refreshing fields), all without needing the latest security updates or having to remember what VBA stands for.

I think the whole field code refreshing bit is unavoidable, arguably desirable at times, and would add some serious complication to an already complicated feature. With python-docx, we use python to generate Word files dynamically. But with refresh-able field codes, those same files can still be highly dynamic long after you shared them with everyone in the Microsoft Office (none of whom knew that python and ninja could even be used in the same sentence).

raphaelvalentin commented 6 years ago

Dear Author, Dear All, a little bit late, however, beyond the philosophy, for some guys that could be interested, please find my snippet:

def _add_number_range(run, name):
    """ add a number range field to a run
    """
    fldChar = OxmlElement('w:fldChar')  # creates a new element
    fldChar.set(qn('w:fldCharType'), 'begin')  # sets attribute on element
    instrText = OxmlElement('w:instrText')
    instrText.set(qn('xml:space'), 'preserve')  # sets attribute on element
    instrText.text = 'SEQ %s \* ARABIC' % name

    fldChar2 = OxmlElement('w:fldChar')
    fldChar2.set(qn('w:fldCharType'), 'separate')
    fldChar3 = OxmlElement('w:t')
    fldChar3.text = "Right-click to update field."
    fldChar2.append(fldChar3)

    fldChar4 = OxmlElement('w:fldChar')
    fldChar4.set(qn('w:fldCharType'), 'end')

    r_element = run._r
    r_element.append(fldChar)
    r_element.append(instrText)
    r_element.append(fldChar2)
    r_element.append(fldChar4)

If you modify the line instrText.text = 'SEQ %s \* ARABIC' % name

by instrText.text = 'TOC \o "1-%d" \h \z \u' % maxlevel # change 1-3 depending on heading levels you need

You can add a TOC.

Hopefully, it could help some guys !

Raphael---

kasyanovse commented 2 years ago

+1

radeeven commented 2 years ago

+1

Abd-Allah-144 commented 2 years ago

by instrText.text = 'TOC \o "1-%d" \h \z \u' % maxlevel # change 1-3 depending on heading levels you need

You can add a TOC.

Hopefully, it could help some guys !

Raphael---

it help me thank you alot!

I have changed the instrText.text to instrText.text = 'PAGEREF Bookmark_1'

so I can get the page No of a bookmark saved, so now i can make my customized TOC easily.

(I was not able to customize a toc, because libreoffice didn't save my settings.)

1krishnasharma commented 1 year ago

hi all, I want to add figures caption with numbering as heading number and subheading number. I want them as a Toggle field

i found https://stackoverflow.com/a/54534731/16951049 but it is not working correctly like, it is giving output 0.1 and increasing only number after dot. but i need like 1.1, 1.2, 2.1 , 2.3 etc.

1 Heading 
   1.1 subheading
     Figure 1.1 caption of the figure
   1.2 subheading
     Figure 1.2 caption of the figure
2 heading 
   2.1 subheading 
     Figure 2.1 caption of the figure
 ..... so on 

I have used this code :-

def Figure(self,paragraph):
        run = run = paragraph.add_run()
        r = run._r
        fldChar = OxmlElement('w:fldChar')
        fldChar.set(qn('w:fldCharType'), 'begin')
        r.append(fldChar)
        instrText = OxmlElement('w:instrText')
        instrText.text = ' SEQ Figure \\* ARABIC'
        r.append(instrText)
        fldChar = OxmlElement('w:fldChar')
        fldChar.set(qn('w:fldCharType'), 'end')
        r.append(fldChar)

    def Table(self,paragraph,level=0):
        run = run = paragraph.add_run()
        r = run._r
        fldChar = OxmlElement('w:fldChar')
        fldChar.set(qn('w:fldCharType'), 'begin')
        r.append(fldChar)
        instrText = OxmlElement('w:instrText')
        instrText.text = ' SEQ Table \\* ARABIC'
        r.append(instrText)
        fldChar = OxmlElement('w:fldChar')
        fldChar.set(qn('w:fldCharType'), 'end')
        r.append(fldChar)

    def section(self,paragraph):
        run = run = paragraph.add_run()
        r = run._r
        fldChar = OxmlElement('w:fldChar')
        fldChar.set(qn('w:fldCharType'), 'begin')
        r.append(fldChar)
        instrText = OxmlElement('w:instrText')
        instrText.text = ' STYLEREF 1 \s '
        r.append(instrText)
        fldChar = OxmlElement('w:fldChar')
        fldChar.set(qn('w:fldCharType'), 'end')
        r.append(fldChar)

@scanny @DBGVA @Abd-Allah-144