python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.61k stars 1.13k forks source link

Feature: Add Bookmark #425

Open Benjamin-T opened 7 years ago

Benjamin-T commented 7 years ago

As multiple issues have stated before, it would be great if we the captions supported in python-docx. Previous issues already should temporary fixes to similar problems like the IndexedEntry (#137) and the figure caption in (#359). Also bookmarks have been investigated: (#109) As I understood from: feature analysis: section columns and Feature: Paragraph.add_hyperlink() (wip) Is that there are a few mandatory steps to take before a pull request is accepted:

Finally: Focus is key, one feature should only by a single functionality, to quote @scanny:

  • No code gets merged without its corresponding unit test.
  • No unit test gets written without a failing acceptance test.
  • Each commit is one atomic piece of functionality, generally a single property or method.

Now I've collected the available issues and pull request my question is, how to go from here. I would assume that we start at the beginning: Feature analysis

Adding the caption to a paragraph isn't directly the issue, the main goal is to create a feature which can be used together with a crossreference. Therefore the number of references or bookmarks that are created should be stored in an object. This object should be available to the caption method, in order to be support a sort of 'guessed figure/table number'. What I mean by this is best explained with an example:

The proposed solution in #359 works. Only he user first has to create a print view to force a fieldcode refresh to make the numbers visible. Therefore, it would be nice if the caption method could already place a "guessed" number. (Comparable as how Word behaves)

scanny commented 7 years ago

Hi @Benjamin-T, yes, creating an analysis document (aka. enhancement proposal) is the right first step, and that is independently committable whether you go on to implement or not.

To do that, you should fork this repo and create a new branch based on master called feature/captions (or whatever is descriptive and short, but starts with feature/...). Add the analysis document and create a pull request whenever you've gotten it far enough along. I'm happy to provide feedback.

Benjamin-T commented 7 years ago

Thanks for the reply @scanny Upon inspection of the code, I think there is a more fundamental feature required in order to make the captions and crossreferences work. They all seem to use a bookmark.

Therefore I would propose first to develop the bookmark code, and subsequently implement the captions and references.

Question: How do I add such a xml element? Should it be placed in the

class CT_R(BaseOxmlElement):

object?

Interesting link: openxml standard page 1032 - Bookmarks Quote of above document about the bookmarks:

Bookmarks

Within a WordprocessingML document, bookmarks refer to arbitrary regions of content which are bounded and have a unique name associated with them. Because bookmarks are a legacy word processing function which predates the concepts of XML and wellformedness, they can start and end at any location within a document's contents and therefore must use the "cross-structure" annotation format described in §2.13.2. [Example: Consider the following WordprocessingML markup for two paragraphs, each reading Example Text, where a bookmark has been added spanning the second word in paragraph one and the first word in paragraph two. The bookmarkStart and bookmarkEnd elements (§2.13.6.2; §2.13.6.1) specify the location where the bookmark starts and ends, but cannot contain it using a single tag because it spans part of two paragraphs. However, the two tags are part of one group because the id attribute value specifies 0 for both. end example] The example:

XML:

<w:p>
   <w:r>
     <w:t>Example</w:t>
   </w:r>
   <w:bookmarkStart w:id="0" w:name="sampleBookmark" />
   <w:r>
     <w:t xml:space="preserve"> text.</w:t>
   </w:r>
</w:p>
<w:p>
  <w:r>
    <w:t>Example</w:t>
  </w:r>
  <w:bookmarkEnd w:id="0" />
  <w:r>
    <w:t xml:space="preserve"> text.</w:t>
  </w:r>
</w:p>

bookmarkStart

This element specifies the start of a bookmark within a WordprocessingML document. This start marker is matched with the appropriately paired end marker by matching the value of the id attribute from the associated bookmarkEnd element. If no bookmarkEnd element exists subsequent to this element in document order with a matching id attribute value, then this element is ignored and no bookmark is present in the document with this name. If a bookmark begins and ends within a single table, it is possible for that bookmark to cover discontiguous parts of that table which are logically related (e.g. a single column in a table). This type of placement for a bookmark is accomplished (and described in detail) on the colFirst and colLast attributes on this element. [Example: Consider a document with a bookmark which spans half of paragraph one, and part of paragraph two. The bookmarkStart element specifies the start of the region for the testing123 bookmark. This element is then linked to the bookmarkEnd element which also has an id attribute value of 0. end example]

bookmarkEnd

This element specifies the end of a bookmark within a WordprocessingML document. This end marker is matched with the appropriately paired start marker by matching the value of the id attribute from the associated bookmarkStart element. If no bookmarkStart element exists prior to this element in document order with a matching id attribute value, then this element is ignored and no bookmark is present in the document with this name. [Example: Consider a document with a bookmark which spans half of paragraph one, and part of paragraph two. The bookmarkEnd element specifies the end of the region for the bookmark whose bookmarkStart element has an id attribute value of 0. In this case, this refers to the testing123 bookmark. end example]

The following WordprocessingML illustrates an example of content which fufills this constraint:


<w:p>
   <w:r>
     <w:t xml:space="preserve">This is sentence one.</w:t>
   </w:r>
   <w:bookmarkStart w:id="0" w:name="testing123"/>
   <w:r>
     <w:t>This is sentence two.</w:t>
   </w:r>
</w:p>
<w:p>
   <w:r>
     <w:t xml:space="preserve">This </w:t>
   </w:r>
   <w:bookmarkEnd w:id="0"/>
   <w:r>
     <w:t>is sentence three.</w:t>
   </w:r>
</w:p>
Benjamin-T commented 7 years ago

Initial proposal

I would place the method 'add_bookmark()' in the file: docx.text.run.py and the xml object in the docx.oxml.run.py

docx.oxml.run

the bookmarkStart element:

from docx.oxml import OxmlElement
def add_bookmark_start(self, name='test_bookmark'):
    bmrk = OxmlElement('w:bookmarkStart')
    bmrk.set(qn('w:id'), '1')
    bmrk.set(qn('w:name'), name)
    return bmrk

The bookmarkEnd element:
def add_bookmark_end(self):
    bmrk = OxmlElement('w:bookmarkEnd')
    bmrk.set(qn('w:id'), '1')
    return bmrk

Challenges:

docx.text.run

In the docx.text.run.py file the following method should be added:

def add_bookmark(self, name='test_bookmark'):
    self._r.append(self._r.add_bookmark_start(name=name))
    self._r.append(self._r.add_bookmark_end())

Simple test code:

import docx
test = docx.Document()
par = test.add_paragraph()
test_run = par.add_run()
test_run.add_text('Test')
test_run.add_bookmark()
test.save('Test_bookmark.docx')
Benjamin-T commented 7 years ago

Feature

I've never worked with behave, but this is my first shot at a feature text:

The feature text:

 Feature: Add bookmark to a run
  In order to add a bookmark to my run
  As a developer using python-docx
  I need a way to add a bookmark to a run
   Scenario: Add a bookmark
    Given a run
     When I add a bookmark
     Then the bookmark appears in that location in the run.

After some googling, I understand running tests is done like this:

Functional tests:

  1. Go to directory docx\features
  2. run behave from commandline

Unittests:

  1. Go to directory docx\tests
  2. run pytest from commandline. (I do get some errors when running pytests, I'm running windows 10, anybody else getting errors?)

To Do:

scanny commented 7 years ago

Regarding how to add an element, in python-docx, each new element type is represented by its own custom element class.

scanny commented 7 years ago

Regarding the API/protocol for this, we'll need to answer a couple XML Schema-related questions, but the call to add a bookmark will definitely not be on a Run object (a bookmark element is a peer to a run, not its child). It could be on a Paragraph object, but may need to go higher, depending on what the possible "parent" elements of a bookmark start and end are.

I'd be inclined (but still in brainstorming mode) to something like:

bookmark = paragraph.start_bookmark('my bookmark name')
# ... add more runs, maybe more paragraphs
same_or_later_paragraph.end_bookmark(bookmark)

A couple notes:

scanny commented 7 years ago

Regarding acceptance tests, these can only exercise the API, and therefore can only test what the API allows us to detect.

A consequence of this is we need the "read" API before (usually just before) we need the "write" API.

So you probably want to be thinking of how you find bookmarks in a document and what one would want to do with them once found.

This is about the time I start taking a close look at the Microsoft Visual Basic API for Word to see how they're handled there. A Google search on 'vba word bookmark' is also a worthwhile exercise, to get a sense for the VBA calling protocol for bookmarks. This will often uncover unexpected complexities that would be very expensive to discover late in the process.

https://msdn.microsoft.com/en-us/VBA/Word-VBA/articles/object-model-word-vba-reference

Benjamin-T commented 7 years ago

First of all, thanks for the detailed response @scanny. I did not give the general implementation much thought, but I think it makes this project a bit more challenging. I've briefly looked into the XML reference, and you're right, there are a lot of different parent objects.

WordprocessingML - documentation

Apparently, the bookmark XML predates the concepts of the XML: Within a WordprocessingML document, bookmarks refer to arbitrary regions of content which are bounded and have a unique name associated with them. Because bookmarks are a legacy word processing function which predates the concepts of XML and wellformedness, they can start and end at any location within a document's contents and therefore must use the "cross-structure" annotation format described in §2.13.2. This makes this certainly an even more interesting challenge, the only thing that is important is that a bookmarkStart element is joined with a bookmarkEnd element with an equal 'id'-attribute value.

They present the following example, of a bookmark spanning a paragraph:

<w:p>
  <w:r>
    <w:t>Example</w:t>
  </w:r>
  <w:bookmarkStart w:id="0" w:name="sampleBookmark" />
  <w:r>
    <w:t xml:space="preserve"> text.</w:t>
  </w:r>
</w:p>
<w:p>
  <w:r>
    <w:t>Example</w:t>
  </w:r>
  <w:bookmarkEnd w:id="0" />
  <w:r>
    <w:t xml:space="preserve"> text.</w:t>
  </w:r>
</w:p>

It seems that therefore the element should be available at 'any' point. The document lists the possible parent objects as well. Its a large list at first, but a lot of it isn't implemented (yet) in the docx package. I've listed the possible parents, and grouped them a bit, sorted them based on the level and likelihood of implementation. - Ironically, the run object is practically the only object in the XML structure that isn't a possible parent of the bookmark object.

Element Name Element Included?
body body yes
p paragraph yes
tbl Table object yes
tc Table Cell yes
tr Table Row yes
endnote Endnote Content no
footnote footnote content no
ftr footer content no
hdr header content no
hyperlink hyperlink no
fldSimple Simple field, determined by fieldcode no
comment comment no
del deleted run content no
ins inserted run content (like tracked changes) no
moveFrom Move Source Run Content (i.e. tracked changes) no
moveTo Move Destination Run Content (i.e. tracked changes) no
customXml cell level custom xml no
customXml row level custom xml no
customXml inline-level custom xml no
customXml block level custom xml no
deg math sign, degree no
den denominator (math) no
e specified base arg. of math func no
lim math, function name limit no
num math, numerator no
oMath math, Office Math no
fName math, function apply obj. like sin, cos no
sub Subscript (Pre-Sub-Superscript) no?
sup Superscript (Superscript function) no?
docPartBody Contents of Glossary Document Entry no
txbxContent Rich Text Box Content Container no
smartTag specifies presence of a tag around inline structureslno
sdtContent Block-Level Structured Document Tag Content no?
sdtContent Cell-Level Structured Document Tag Content no?
sdtContent Row-Level Structured Document Tag Content no?
sdtContent Inline-Level Structured Document Tag Content no?
rt Phonetic Guide Text no
rubyBase Phonetic Guide Base Text no

Requirements:

For now just thought of two, probably more.

  1. Must have an unique name
  2. Must have a matching id-attribute.

Custom Element Class

For the bookmark to work, we would need two custom element classes: CT_BookmarkStart and CT_BookmarkEnd.

Benjamin-T commented 7 years ago

I've renamed the feature to add_bookmark only. As it looks like it's a better idea to first fully focus on the correct implementation of the bookmark object, and afterwards see how things like captions and cross references interact with the bookmarks.

More detailed VBA function analysis

Like you recommended, I've investigated the VBA bookmark documentation. This provided some more background and insight about what kind of behavior and functionality is expected from the bookmark object.

Bookmarks

A collection of bookmark object that represent the bookmarks in the specified selection, range, or document. Also, the object returns the bookmark corresponding to the index, which is either a key or a number. The numbered index can be used, and the named indices can apparently be sorted alphabetically.
There is also a set of predefined bookmarks - not sure whether these can be accessed by the package.

Bookmark methods:

There are three methods described:

  1. Copy - copy method of the bookmark, requires a new name as input, which will subsequently be the name of the copied bookmark.
  2. Delete - Deletes the bookmark from the bookmarks object.
  3. Select - Selects the part spanned by the bookmark. (is this even possible to implement in this package?) The select method returns a 'Selection' object, which in turn can be used to apply certain formatter.

Bookmark properties:

There is a total of ten properties:

  1. Application - Returns the word application
  2. Column - Boolean, True if specified bookmark is a table column.
  3. Creator - Integer indicating the application which was used to create the object.
  4. Empty - True if specified bookmark is empty
  5. End - Returns or sets the ending character position of a selection, range, or bookmark. Read/write link
  6. Name - Returns the name of the bookmark
  7. Parent - Returns the parent object of the specified bookmark object.
  8. Range - Returns a Range object that represents the portion of a document that's contained in the specified object.
  9. Start - Returns or sets the starting character position of a bookmark. Read/write
  10. StoryType - Returns the story type for the specified range, selection, or bookmark. Read-only

Properties Included or Omitted

I think features 1, 3, 10 can be omitted, and I'm unsure about 8. This leaves features, 2, 4, 5, 6, 7, 8, 9. Which is already a nice list.

Correct me if I'm wrong, but can I conclude from this that there should be some Container object for bookmarks?

Benjamin-T commented 7 years ago

Bookmarks - Object

Continuing on the bookmarks 'container' object, it has the following properties and methods:

Properties

  1. Application - The word application
  2. Count - The total number of items in the bookmarks collections (read only)
  3. Creator - Application integer
  4. Sorting - Sort by location, or sort alphabetically (It should only be a mapping; " This property doesn't affect the order of Bookmark objects in the Bookmarks collection." click
  5. Parent - Returns the object corresponding to the specifeid bookmarks collection (unclear what is exactly meant by this, does it return the parent object of a single bookmark of from all the bookmarks?)
  6. ShowHidden - True if hidden bookmarks are included in the bookmarks collection, Read/Write boolean

    Methods:

  7. Add - Returns a Bookmark object that represents a bookmark added to a range .Add( Name, Range) Name: The name of the bookmark. The name cannot be more than 40 characters or include more than one word. Range: The range of text marked by the bookmark. A bookmark can be set to a collapsed range (the insertion point)
  8. Exists - Determines whether the specified bookmark exists. Returns True if the bookmark exists.
  9. Item - Returns an individual bookmark object in a collection (takes an index as input)
scanny commented 7 years ago

Okay, regarding the parentage aspect, I'm thinking the key parent elements to start with are w:body and w:p. Maybe the table ones come in later, I'll have to noodle that a bit more, but table cells contain paragraphs, so you'd already be able to put one inside a cell, and I can't think of any super-common uses for a bookmarked cell, row, or table at the moment.

One possibility is to implement the bookmark functionality as a mixin, so it can be added to multiple "locations" without duplicating the code. That would also provide a uniform interface, regardless of the parent object. Maybe we call that BookmarkParentMixin or maybe something a little better. That approach would also make it a lot easier to add different "parents" later on.

scanny commented 7 years ago

Regarding the MS-API for bookmarks:

scanny commented 7 years ago

Regarding the MS-API for the Bookmarks collection:

As for the methods:

So in short, I think we're good with a Bookmarks object that has the behaviors I outlined in the last bullet of the previous comment.

Benjamin-T commented 7 years ago

Allright, summarizing: If I understand correctly we're going to develop a Bookmarks object, which will contain a bookmark object.

the Bookmark object will initially only contain the following properties / attributes:

Bookmark - Object

Attributes / properties

  1. Name - Returns the name of the bookmark
  2. Start - Returns or sets the starting character position of a bookmark. Read/write
  3. End - Returns or sets the ending character position of a selection, range, or bookmark. Read/write link

    These could be added in the 'future'.

  4. Empty - True if specified bookmark is empty
  5. Parent - Returns the parent object of the specified bookmark object.
  6. Column - Boolean, True if specified bookmark is a table column.

The empty property could come in handy when one wants to refer to the text in the bookmark. Both the parent and column properties can be included when the bookmark is extended to include the table, since this introduces a possible differen parent? (in its current form, it will always be the document, but in the case of the table, it will be the table object, right? )

Methods / setters

  1. start_bookmark
  2. end_bookmark
  3. bookmark text - not sure where to put this, but the bookmark can contain a specific text.

Bookmarks - Object

The bookmarks object will be placed within the document object.

Attributes / properties:

  1. Count property len()
  2. Exists might be required as duplicated bookmarks can result in unwanted behaviors.

methods

  1. add_bookmark - returns a bookmark object which has the startstart_bookmark and end_bookmark methods.
  2. three get-methods:
    • get_by_index
    • get_by_name
    • get_by_id() -- not sure whether word edits this, and what kind of added value this method has.

I'm currently trying to develop some kind of basic version of the above.

Benjamin-T commented 7 years ago

Without thinking about the python-docx package too much, I had something like this in mind. I'm familiar with a lot of the super() and metaclass concepts, therefore implementing this will be a challenge ;)

from docx.oxml.ns import qn
from docx.oxml import OxmlElement

class Bookmark(object):
    def __init__(self):
        self.bmrk_start = OxmlElement('w:bookmarkStart')
        self.bmrk_end = OxmlElement('w:bookmarkEnd')
        self.name = None
        self.id = None

    def set_id(self, id):
        self.id = id
        self.bmrk_start.set(qn('w:id'), str(id))
        self.bmrk_end.set(qn('w:id'), str(id))
        return self

    def set_name(self, name):
        self.name = name
        return self

    def start_bookmark(self):
        return self.bmrk_start

    def end_bookmark(self):
        return self.bmrk_en

class Bookmarks(object):
    def __init__(self):
        self.bookmarks = []
        self.mapping = []

    def __iter__(self):
        for bookmark in self.bookmarks:
            yield bookmark

    @property
    def count(self):
        return len(self.bookmarks)

    def add_bookmark(self, name):
        idx = self.count + 1
        bm = Bookmark().set_id(idx)
        bm.set_name(name)
        self.bookmarks.append(bm)
        self.mapping.append(name)
        return bm

    def get_bookmark(self, name):
        if name in self.mapping:
            return self.mapping.index(name)

    def item(self, idx):
        return self.bookmarks[idx]
scanny commented 7 years ago

Okay, so lets start with your proposed interface for Bookmark.

scanny commented 7 years ago

Regarding implementation details, let's start at the "outside" with Document.bookmarks:

class Document(...):

    # ---other properties and methods ...

    @lazyproperty
    def bookmarks(self):
        return Bookmarks(self._element)

So on to the Bookmarks class:

from collections import Sequence

class Bookmarks(Sequence):

    def __init__(self, document_elm):
        super(Bookmarks, self).__init__()
        self._document = self._element = self.document_elm

    def __iter__(self):
        for bookmarkStart in self._bookmarkStarts:
            yield Bookmark(bookmarkStart)

    def __getitem__(self, idx):
        bookmarkStart = self._bookmarkStarts[idx]
        return Bookmark(bookmarkStart)

    def __len__(self):
        return len(self._bookmarkStarts)

    def get(self, name, default=None):
        for bookmarkStart in self._bookmarkStarts:
            if bookmarkStart.name == name:
                return Bookmark(bookmarkStart)
        return default

    @property
    def _bookmarkStarts(self):
        return self._document.xpath('.//w:bookmarkStart')

I'm out of time right now, let's leave it there and let me know how you digest this chunk, then we can get into the Bookmark class and what the oxml layer beneath it looks like. I think you'll find that pretty elegant and easy to understand once you see it more closely.

Benjamin-T commented 6 years ago

First of all, thanks for all the ideas and input. I really feel like I'm learning a lot here. And I really appreciate you taking the time to explain me some of the concepts.

I know its a bit too soon, but I really wanted to give it a go, so I've tried to create the CT_BookmarkStart and CT_bookmarkEnd elements. I was not sure where to put them, so I created the bookmarks.py file.

If you could please take a look and see if I'm on the right track? I really struggled with understanding how these CT_.... objects were supposed to be created. So I ended up with the ones I've commited. The code works, but that's not the only goal here ;)

github-link

scanny commented 6 years ago

If you want to submit code for review, a pull request (PR) is the best way to do it. You might need to to a little research first. The key thing is to be able to make and submit changes without creating a new pull request each time. Basically, the way you do this is creating a new branch, based on master (on your fork) and making that branch the base of the PR. When you make changes, you rebase that branch to suit, and the PR is automatically updated (when you push that branch).

If you're not familiar with rebasing with Git you'll need to study up; it's an essential skill for this kind of interaction.

Then I can make comments directly on the code changes you submitted; you can then make changes, rebase the branch and push, and submit the revised code.

Regarding your first stab:

I'd recommend you focus on getting the pull request bit working and show me where you are on the analysis document, in particular the XML Schema excerpts, and then we can come back to these two definitions.

Btw, there's nothing wrong (in my mind) with jumping into the implementation, just to get straight in your mind how it's all going to work. I do that all the time (I call it 'spiking'). The only problem is thinking you're done once you have it working. That's where most folks stop and that's why they don't get a merge. From there you are in a good position to develop the acceptance and unit tests, flesh out the documentation required, and form it into a series of coherent commits that can be confidently committed without a lot of rework. All those things are required to actually add the feature without degrading the existing code base.

Benjamin-T commented 6 years ago

Allright! thanks for the feedback, I guess i have some studying to do.

I got rid of the set methods, however, the CT_BookmarkEnd and CT_BookmarkStart names do not rename to 'w:bookmarkEnd' and 'w:bookmarkStart'. But I dont understand why and how I can change this?

class CT_BookmarkStart(BaseOxmlElement):
    """
    ``<w:bookmarkStart>`` element
    """
    bookmark = OneOrMore('w:bookmarkStart', ST_String)
    name = RequiredAttribute('w:name', ST_String)
    bmrk_id = RequiredAttribute('w:id', ST_RelationshipId)
    def add_bookmark_start(self, id, name='test_bookmark'):
        bmrk = CT_BookmarkStart()
        bmrk.bookmark = 'w:bookmarkStart'
        bmrk.bmrk_id = str(id)
        bmrk.name = name
        self.append(bmrk)
        return bmrk

This leads to the following xml:

    <w:p>
      <w:r>
        <w:t>Hello</w:t>
      </w:r>
      <CT_BookmarkStart w:id="2" w:name="text_to_be_bookmarked" />
      <w:r>
        <w:t>Hello</w:t>
      </w:r>
      <CT_BookmarkEnd w:id="2" />
    </w:p>

Clearly Im overlooking something...

Benjamin-T commented 6 years ago

After hours of puzzeling and camparing code, I got it working! the 'catch' was 'registering' the bookmarkStart elements as xml elements using the self._add_bookmarkstart() method in the paragraph object.

I'also created a pull request like you asked; I purposly committed the 'removed obsolete code' commit to check if the pull request is automatically updated with the new code. Let me know if I need to change some settings.

the \oxml\paragraph.py file:

class CT_P(BaseOxmlElement):
    """
    ``<w:p>`` element, containing the properties and text for a paragraph.
    """
    pPr = ZeroOrOne('w:pPr')
    r = ZeroOrMore('w:r')
    bookmarkStart = ZeroOrMore('w:bookmarkStart')
    bookmarkEnd = ZeroOrMore('w:bookmarkEnd')

the text\paragraph.py file:

from __future__ import (
    absolute_import, division, print_function, unicode_literals
)

from ..enum.style import WD_STYLE_TYPE
from .parfmt import ParagraphFormat
from .run import Run
from ..shared import Parented

from docx.shared import ElementProxy
from docx.oxml.bookmark import CT_BookmarkStart, CT_BookmarkEnd 

class Paragraph(Parented):
    """
    Proxy object wrapping ``<w:p>`` element.
    """
    def __init__(self, p, parent):
        super(Paragraph, self).__init__(parent)
        self._p = self._element = p

    def add_bookmark_start(self, id, name):
        bmrk = self._p._add_bookmarkStart()
        bmrk.name = name
        bmrk.bmrk_id = str(id)
        return self._p.append(bmrk) 

    def add_bookmark_end(self, id):
        bmrk = self._p._add_bookmarkEnd()
        bmrk.bmrk_id =str(id)
        return self._p.append(bmrk)

and the elements:

class CT_BookmarkStart(BaseOxmlElement):
    """
    ``<w:bookmarkStart>`` element
    """
    name = RequiredAttribute('w:name', ST_String)
    bmrk_id = RequiredAttribute('w:id', ST_RelationshipId)

class CT_BookmarkEnd(BaseOxmlElement):
    """
    The ``<w:bookmarkEnd>`` element
    """
    bmrk_id = RequiredAttribute('w:id', ST_RelationshipId)
Benjamin-T commented 6 years ago

Since I finally got a half decent version working, I'll start focussing more on the development of the analysis documentation. Also I just found out that 'location' of the feature is within the CT_Body and not in the CT_P I've currently placed it.

The reason I placed it in the paragraph type is that the bookmarks are placed in a paragraph. Therefore i assumed that that would be the logical location.

This is interesting; the element type of the bookmarkEnd element is CT_MarkupRange

        <xsd:element name="bookmarkStart"               type="CT_Bookmark"/>
        <xsd:element name="bookmarkEnd"                 type="CT_MarkupRange"/>
  <xsd:complexType name="CT_Body">  <!-- denormalized -->
    <xsd:sequence>
      <xsd:choice minOccurs="0" maxOccurs="unbounded">
        <xsd:element name="p"                           type="CT_P"/>
        <xsd:element name="tbl"                         type="CT_Tbl"/>
        <xsd:element name="sdt"                         type="CT_SdtBlock"/>
        <xsd:element name="customXml"                   type="CT_CustomXmlBlock"/>
        <xsd:element name="altChunk"                    type="CT_AltChunk"/>
        <xsd:element name="proofErr"                    type="CT_ProofErr"/>
        <xsd:element name="permStart"                   type="CT_PermStart"/>
        <xsd:element name="permEnd"                     type="CT_Perm"/>
        <xsd:element name="bookmarkStart"               type="CT_Bookmark"/>
        <xsd:element name="bookmarkEnd"                 type="CT_MarkupRange"/>
        <xsd:element name="moveFromRangeStart"          type="CT_MoveBookmark"/>
        <xsd:element name="moveFromRangeEnd"            type="CT_MarkupRange"/>
        <xsd:element name="moveToRangeStart"            type="CT_MoveBookmark"/>
        <xsd:element name="moveToRangeEnd"              type="CT_MarkupRange"/>
        <xsd:element name="commentRangeStart"           type="CT_MarkupRange"/>
        <xsd:element name="commentRangeEnd"             type="CT_MarkupRange"/>
        <xsd:element name="customXmlInsRangeStart"      type="CT_TrackChange"/>
        <xsd:element name="customXmlInsRangeEnd"        type="CT_Markup"/>
        <xsd:element name="customXmlDelRangeStart"      type="CT_TrackChange"/>
        <xsd:element name="customXmlDelRangeEnd"        type="CT_Markup"/>
        <xsd:element name="customXmlMoveFromRangeStart" type="CT_TrackChange"/>
        <xsd:element name="customXmlMoveFromRangeEnd"   type="CT_Markup"/>
        <xsd:element name="customXmlMoveToRangeStart"   type="CT_TrackChange"/>
        <xsd:element name="customXmlMoveToRangeEnd"     type="CT_Markup"/>
        <xsd:element name="ins"                         type="CT_RunTrackChange"/>
        <xsd:element name="del"                         type="CT_RunTrackChange"/>
        <xsd:element name="moveFrom"                    type="CT_RunTrackChange"/>
        <xsd:element name="moveTo"                      type="CT_RunTrackChange"/>
        <xsd:element ref="m:oMathPara"/>
        <xsd:element ref="m:oMath"/>
      </xsd:choice>
      <xsd:element name="sectPr" minOccurs="0" maxOccurs="1" type="CT_SectPr"/>
    </xsd:sequence>
  </xsd:complexType>
Benjamin-T commented 6 years ago

from the wml.xsd excerpts:

  <xsd:complexType name="CT_BookmarkRange">
    <xsd:complexContent>
      <xsd:extension base="CT_MarkupRange">
        <xsd:attribute name="colFirst" type="ST_DecimalNumber" use="optional">
          <xsd:annotation>
            <xsd:documentation>First Table Column Covered By Bookmark</xsd:documentation>
          </xsd:annotation>
        </xsd:attribute>
        <xsd:attribute name="colLast" type="ST_DecimalNumber" use="optional">
          <xsd:annotation>
            <xsd:documentation>Last Table Column Covered By Bookmark</xsd:documentation>
          </xsd:annotation>
        </xsd:attribute>
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>
  <xsd:complexType name="CT_Bookmark">
    <xsd:complexContent>
      <xsd:extension base="CT_BookmarkRange">
        <xsd:attribute name="name" type="ST_String" use="required">
          <xsd:annotation>
            <xsd:documentation>Bookmark Name</xsd:documentation>
          </xsd:annotation>
        </xsd:attribute>
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>
  <xsd:group name="EG_RangeMarkupElements">
    <xsd:choice>
      <xsd:element name="bookmarkStart" type="CT_Bookmark">
        <xsd:annotation>
          <xsd:documentation>Bookmark Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="bookmarkEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Bookmark End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="moveFromRangeStart" type="CT_MoveBookmark">
        <xsd:annotation>
          <xsd:documentation>Move Source Location Container - Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="moveFromRangeEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Move Source Location Container - End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="moveToRangeStart" type="CT_MoveBookmark">
        <xsd:annotation>
          <xsd:documentation>Move Destination Location Container - Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="moveToRangeEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Move Destination Location Container - End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="commentRangeStart" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Comment Anchor Range Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="commentRangeEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Comment Anchor Range End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlInsRangeStart" type="CT_TrackChange">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Insertion Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlInsRangeEnd" type="CT_Markup">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Insertion End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlDelRangeStart" type="CT_TrackChange">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Deletion Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlDelRangeEnd" type="CT_Markup">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Deletion End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlMoveFromRangeStart" type="CT_TrackChange">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Move Source Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlMoveFromRangeEnd" type="CT_Markup">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Move Source End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlMoveToRangeStart" type="CT_TrackChange">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Move Destination Location Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlMoveToRangeEnd" type="CT_Markup">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Move Destination Location End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
    </xsd:choice>
  </xsd:group>
scanny commented 6 years ago

Hmm, this is interesting. I haven't come across annotation and documentation strings in the schema before. Where did you pull these from?

The schema files I use are in the code tree here (and in the directory just above): https://github.com/python-openxml/python-docx/blob/master/ref/xsd/wml.xsd

scanny commented 6 years ago

Also I just found out that 'location' of the feature is within the CT_Body and not in the CT_P I've currently placed it.

The reason I placed it in the paragraph type is that the bookmarks are placed in a paragraph. Therefore i assumed that that would be the logical location.

Not sure what you mean by the "feature". This term is usually used here to indicate the API methods and properties added to implement some useful behavior.

But bookmarks appear to be part of both the body and paragraph objects and some others as well. So I'm not sure there's a clear "locale" for these, unlike most other objects in the code base.

A bookmark can be added to the body (document from the API perspective) just as well as inside a paragraph, and it's possible I expect that one can start in one place and end in the other.

So something to noodle on :)

Benjamin-T commented 6 years ago

Hmm, this is interesting. I haven't come across annotation and documentation strings in the schema before. Where did you pull these from? The schema files I use are in the code tree here (and in the directory just above): https://github.com/python-openxml/python-docx/blob/master/ref/xsd/wml.xsd

this is a really interesting document:

xml-reference

scanny commented 6 years ago

Ah, of course, the ECMA spec, I had forgotten about that. It's been a while, so I'm not sure of the current status, but when I started the project I was using the ECMA spec until I started puzzling over why certain elements weren't in there or were named differently.

The short story (at the time at least) is that Microsoft uses the ISO spec (ISO/IEC 29500).

Anyway, the ECMA spec might be useful for reference, but I believe the ISO spec is still the definitive one. I'll be looking for XML Schema excerpts from the ISO spec/schemas.

Benjamin-T commented 6 years ago

@Scanny, I've studied up on the rebasing, and I think I got it figured out. It's a really neat trick! This makes collaboration a lot easier. Normally, I was always 'worried' about the number of commits for a certain fix, but with this rebasing you can combine / and reorder and hence restructure your work afterworks.

Having said that, I've taken a stab at the feature documentation. See pull request

With regard to the objects, we still had to figure out how to link the two parts of a Bookmark (bookmarkStart and bookmarkEnd) together, if there is like a paragraph in between.

My suggestion would be to make two ways to add a bookmark;

# This adds a paragraph containing the start and stop of the bookmark
document.add_bookmark(name='test', inplace=True)

# This places a bookmark start:
bookmark_obj_1 = document.add_bookmark(name='test_1')
bookmark_obj_1.start_bookmark()

# Start some new run, with text which maybe contains another bookmark:
par = document.add_paragraph()
bookmark_obj_2 = par.add_bookmark(name'test_2')
bookmark.start_bookmark()
# In this case, the bookmark contains the text, which could be referred to
par.add_run('sometext')
bookmark_obj_2.end_bookmark()

# Add the end of the first bookmark:
bookmark_obj_1.end_bookmark()

This does mean that one has to correctly place the bookmark object to be able to succesfully add the end the bookmark. I'm not entirely sure about the exact implementation. I think it has some challenges;

I'll try to get an example up and running.

Benjamin-T commented 6 years ago

So, I've had my first struggles with git, but somehow managed to figure it out. I think it had something to do with the fact that I use both Sourcetree and the git shell commands. I think i forgot to fetch from all remotes and that is where the fun started..

What I've tried to do, is implement the Bookmarks and Bookmark object. And make them available in the document class. Currently they do generate a bookmark, but this bookmark is placed at the end of the Body.

using this code:

doc = Document()
asdf = doc.bookmark()
new_bookmark = asdf.add_bookmark()
new_bookmark.add_bookmark_start(name='test_1', id=1)
par = doc.add_paragraph()
par.add_run('test')
new_bookmark.add_bookmark_end(id=1)
new_bookmark_2 = asdf.add_bookmark()
new_bookmark_2.add_bookmark_start(name='test_2', id=2)
new_bookmark_2.add_bookmark_end(id=2)
doc.save('tab.docx')
<w:document>
  <w:body>
    <w:p>
      <w:r>
        <w:t>test</w:t>
      </w:r>
    </w:p>
    <w:sectPr w:rsidRPr="0006063C" w:rsidR="00FC693F" w:rsidSect="00034616">
      <w:pgSz w:w="12240" w:h="15840" />
      <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0" />
      <w:cols w:space="720" />
      <w:docGrid w:linePitch="360" />
    </w:sectPr>
    <w:bookmarkStart w:name="test_1" w:id="1" />
    <w:bookmarkEnd w:id="1" />
    <w:bookmarkStart w:name="test_2" w:id="2" />
    <w:bookmarkEnd w:id="1" />
  </w:body>
</w:document>

I hope I have some time tomorrow to figure it out, I'm thinking that the bookmark should be within a paragraph instance.

regards, Ben

scanny commented 6 years ago

Okay, so here's a little about what I think the code structure will look like. These sorts of things (proxy objects) all look the same in many respects. The term proxy object is just our name for objects like Paragraph and Document that "wrap" (proxy) an XML element, providing API calls that raise the abstraction level from the lxml/oxml level up to something more suitable as a developer interface.

Let's start with the calling protocol I think we're after. This might bear more discussion, but this will do for a start:

>>> document = Document()
>>> bookmark = document.start_bookmark(name='test')
>>> document.add_paragraph('test bookmark body')
<Paragraph object at 0x...>
>>> document.end_bookmark(bookmark)

I've explicitly named the parameters, just for clarity. I expect there's no good reason not to allow them to be identified by position, e.g. .start_bookmark('test') should be allowed. Later there will also need to be Paragraph.start_bookmark() and Paragraph.end_bookmark() and perhaps others, all of which can be mixed and matched (start in any, end in same or different ). But this is a good first step, much of which will be reusable or adaptable for the other cases.

So we'd want the code above to produce this XML:

<w:document>
  <w:body>
    <w:bookmarkStart w:name="test" w:id="1"/>
    <w:p>
      <w:r>
        <w:t>test</w:t>
      </w:r>
    </w:p>
    <w:bookmarkEnd w:id="1"/>
    <w:sectPr w:rsidRPr="0006063C" w:rsidR="00FC693F" w:rsidSect="00034616">
      <w:pgSz w:w="12240" w:h="15840"/>
      <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
      <w:cols w:space="720"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>

The first challenge is to get the bookmarkStart and End elements to appear above the sentinel w:sectPr element. It looks like that can be done by adding this line to CT_Body in docx.oxml.document:

bookmarkStart = ZeroOrMore('w:bookmarkStart', successors=('w:sectPr',))

The implementation in docx.document._Body would be something like this::

def start_bookmark(self, name):
    bookmarkStart = self._body.add_bookmarkStart(name)
    return Bookmark(bookmarkStart)

This leaves some of the work to other objects. In particular, the bookmark id must be resolved and assigned by CT_Body.add_bookmark(). There is already an .add_bookmark() method on CT_Body, placed there by the metaclass as a consequence of the ZeroOrMore() call above. That will need to be overridden to give it that further functionality. An alternative would be to use a different name like .add_bookmarkStart_with_next_id() or something, but I typically just override. I'd have to look around for another element class that does that to what all needs to go in there.

In addition, I expect there needs to be a ._next_available_bookmark_id helper property on CT_Body that can take care of that little chore. .add_bookmarkStart() would automatically assign the next available bookmark id and then it would be available from the bookmarkStart element that's passed to the Bookmark constructor.

This may be a good place to mention that a proxy object (like Bookmark) should have no state of its own. All state is in the XML. So, for example, if you wanted to add a property Bookmark.id, when called, it would go to the XML to find out and return that value. There would be no ._id instance variable hanging around to get out of sync with the XML. Its implementation would be something like return self._element.id.

Note that .end_bookmark() is on Document, and that another method of the same name will need to be on Paragraph and anywhere else a bookmark can end. Just because it starts in Document (under the w:body element) doesn't mean it has to end there. But in either case, a bookmark can't end itself. It needs reference to a context element (e.g. w:body, w:paragraph, etc.) and so the most natural interface I think is on the proxy object for that element (e.g. Body or Paragraph respectively). I suppose an interface like Bookmark.start_on(paragraph) and Bookmark.end_on(document) is possible, but it's not screaming out to me so far as a superior design.

Does that give you something to noodle on for next steps?

scanny commented 6 years ago

@scanny, I've studied up on the rebasing, and I think I got it figured out. It's a really neat trick! This makes collaboration a lot easier. Normally, I was always 'worried' about the number of commits for a certain fix, but with this rebasing you can combine / and reorder and hence restructure your work afterworks.

It is a neat trick, isn't it! :)

I call when I'm getting the commits cleaned up "grooming" the commit history/sequence. It's not always worthwhile in every situation, but when you're collaborating it's a really nice way to make your commit(s) reflect a clear, single focus and be separately mergeable.

Benjamin-T commented 6 years ago

So I've succeeded in getting the elements above the sentinel object, but am now stuck on the item id's.

my initial thought is making the Bookmarks object available within the _Body object. This allows for getting 'new' id's. However, I'm still trying to figure out a neat solution for identifying bookmarkStarts without bookmarkEnds.

current approach:


The docx.document.py code:

class Bookmark(ElementProxy):
    def __init__(self, doc_element):
        super(Bookmark, self).__init__(doc_element)
        self._element = doc_element

    @property
    def _id(self):
        return self._element.bmrk_id

    @property
    def _name(self):
        return self._element.name
class _Body(BlockItemContainer):
    """
    Proxy for ``<w:body>`` element in this document, having primarily a
    container role.
    """

    def __init__(self, body_elm, parent):
        super(_Body, self).__init__(body_elm, parent)
        self._body = body_elm
        self._parent = parent

    def clear_content(self):
        """
        Return this |_Body| instance after clearing it of all content.
        Section properties for the main document story, if present, are
        preserved.
        """
        self._body.clear_content()
        return self

    def add_bookmarkStart(self, name):
        bookmarkStart = self._body._add_bookmarkStart()
        bookmarkStart.name = name
        bookmarkStart.bmrk_id = self._parent._Bookmarks._next_id
        return Bookmark(bookmarkStart)

    def add_bookmarkEnd(self):
        open_id = self._parent._Bookmarks._unclosed_bookmark
        bookmarkEnd = self._body._add_bookmarkEnd()
        bookmarkEnd.bmrk_id = open_id
        return Bookmark(bookmarkEnd)

and the Bookmarks class:

class Bookmarks(Sequence):
    def __init__(self, document_elm):
        super(Bookmarks, self).__init__()
        self._document = self._element = document_elm

    def __iter__(self):
        for bookmarkStart in self._bookmarkStarts:
            yield Bookmark(bookmarkStart)

    def __getitem__(self, idx):
        bookmarkStart = self._bookmarkStarts[idx]
        return Bookmark(bookmarkStart)

    def __len__(self):
        return len(self._bookmarkStarts)

    def get(self, name, default=None):
        for bookmarkStart in self._bookmarkStarts:
            if bookmarkStart.name == name:
                return Bookmark(bookmarkStart)
        return default

    @property
    def _bookmarkStarts(self):
        return self._document.xpath('.//w:bookmarkStart')

    @property
    def _bookmarkEnds(self):
        return self._document.xpath('.//w:bookmarkEnd')

    @property
    def _next_id(self):
        return str(len(self._bookmarkStarts))
Benjamin-T commented 6 years ago

Hmm, I'm not sure about this solution. This checks for open bookmarks, but if there are two bookmarks open there is no way of choosing one of the two. Therefore, we might need to go for a add_bookmark_end(name='name_of_specific_bookmark')

This is the current code outline:

doc = Document()
bmark_obj = doc.bookmarks()
test = doc.start_bookmark('test_one')
par = doc.add_paragraph()
par.add_run('test')
doc.end_bookmark()
doc.start_bookmark('test_two')
par_2 = doc.add_paragraph()
par_2.add_run('test_line_2')
doc.end_bookmark()
doc.save('test.docx')

results in:

<w:document>
  <w:body>
    <w:bookmarkStart w:name="test_one" w:id="1" />
    <w:p>
      <w:r>
        <w:t>test</w:t>
      </w:r>
    </w:p>
    <w:bookmarkEnd w:id="1" />
    <w:bookmarkStart w:name="test_two" w:id="2" />
    <w:p>
      <w:r>
        <w:t>test_line_2</w:t>
      </w:r>
    </w:p>
    <w:bookmarkEnd w:id="2" />
    <w:sectPr w:rsidRPr="0006063C" w:rsidR="00FC693F" w:rsidSect="00034616">
      <w:pgSz w:w="12240" w:h="15840" />
      <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0" />
      <w:cols w:space="720" />
      <w:docGrid w:linePitch="360" />
    </w:sectPr>
  </w:body>
</w:document>
scanny commented 6 years ago

We should probably keep all the conversation in the pull request going forward, so it's all in one piece. I'll respond to these here but after that, let's do any new ones in the Conversations tab of the pull request :)

scanny commented 6 years ago

Okay, I don't completely understand your code for the "finding the ids" bit, but I'm thinking the solution is simpler than you think, basically that we use XPath to find the ids when we need them (and that we don't store them anywhere outside the XML).

I'm also not sure we need an "unclosed" test. I'm open to argumentation to the contrary, but I'm not seeing it yet. It looks to me like there are a certain number of w:bookmarkStart elements in the XML, each one represents a bookmark, and a bookmark can be either open or closed. We don't need to determine whether it's open or not unless someone cares to ask or if we need to do an operation that only works on a closed bookmark. We can certainly have an .is_closed property on Bookmark if we need it (I don't see yet that we do though) and we may very well want a ._bookmarkEnd property on Bookmark, although I haven't seen the use case for that yet (I suspect it's there).

I do think we're on the wrong side of YAGNI here, working with methods and properties we don't have a clear need for yet. At times like these, I always return to the use-case/protocol part of the analysis and narrow my focus down to something entirely concrete. Basically answer the question "As a developer using python-docx, I need to be able to {do what exactly?}"

Anyway, back to the "finding the bookmarkEnd elements" question.

So say we have a bookmarkStart and we want to find its matching bookmarkEnd. First of all, we should only do this in the oxml layer (not the proxy layer) and we'd do it something like this:

class CT_BookmarkStart(...):

    @property
    def matching_bookmarkEnd(self):
        """Return first `w:bookmarkEnd` element having matching id, None if not present."""
        root_element = self.getroottree().getroot() 
        matching_bookmarkEnds = root_element.xpath(
            './/w:bookmarkEnd[@w:id=\'%d\']' % self.id
        )
        if not matching_bookmarkEnds:
            return None
        return matching_bookmarkEnds[0]

If you're available for a call or a Google hangout, it might be a faster way to work through some of these initial concepts, let me know :)