Feature: Add Bookmark - Githubissues

python-openxml / python-docx

Create and modify Word documents with Python

MIT License

4.61k stars 1.13k forks source link

Feature: Add Bookmark #425

Open Benjamin-T opened 7 years ago

Benjamin-T commented 7 years ago

As multiple issues have stated before, it would be great if we the captions supported in python-docx. Previous issues already should temporary fixes to similar problems like the IndexedEntry (#137) and the figure caption in (#359). Also bookmarks have been investigated: (#109) As I understood from: feature analysis: section columns and Feature: Paragraph.add_hyperlink() (wip) Is that there are a few mandatory steps to take before a pull request is accepted:

Feature analysis:
1. XML schema analysis
2. Candidate API (preferably based on the VBA API
3. Specimen XML
Tests:
1. Acceptance Test
2. Unittests

Finally: Focus is key, one feature should only by a single functionality, to quote @scanny:

No code gets merged without its corresponding unit test.

No unit test gets written without a failing acceptance test.

Each commit is one atomic piece of functionality, generally a single property or method.

Now I've collected the available issues and pull request my question is, how to go from here. I would assume that we start at the beginning: Feature analysis

Adding the caption to a paragraph isn't directly the issue, the main goal is to create a feature which can be used together with a crossreference. Therefore the number of references or bookmarks that are created should be stored in an object. This object should be available to the caption method, in order to be support a sort of 'guessed figure/table number'. What I mean by this is best explained with an example:

The proposed solution in #359 works. Only he user first has to create a print view to force a fieldcode refresh to make the numbers visible. Therefore, it would be nice if the caption method could already place a "guessed" number. (Comparable as how Word behaves)

scanny commented 7 years ago

Hi @Benjamin-T, yes, creating an analysis document (aka. enhancement proposal) is the right first step, and that is independently committable whether you go on to implement or not.

To do that, you should fork this repo and create a new branch based on master called feature/captions (or whatever is descriptive and short, but starts with feature/...). Add the analysis document and create a pull request whenever you've gotten it far enough along. I'm happy to provide feedback.

Benjamin-T commented 7 years ago

Thanks for the reply @scanny Upon inspection of the code, I think there is a more fundamental feature required in order to make the captions and crossreferences work. They all seem to use a bookmark.

Therefore I would propose first to develop the bookmark code, and subsequently implement the captions and references.

From the xml analysis below I would propose to add two xml elements; the bookmarkStart and bookmarkEnd.
The functionality should be available in a run.

Question: How do I add such a xml element? Should it be placed in the

class CT_R(BaseOxmlElement):

object?

Interesting link: openxml standard page 1032 - Bookmarks Quote of above document about the bookmarks:

Bookmarks

Within a WordprocessingML document, bookmarks refer to arbitrary regions of content which are bounded and have a unique name associated with them. Because bookmarks are a legacy word processing function which predates the concepts of XML and wellformedness, they can start and end at any location within a document's contents and therefore must use the "cross-structure" annotation format described in §2.13.2. [Example: Consider the following WordprocessingML markup for two paragraphs, each reading Example Text, where a bookmark has been added spanning the second word in paragraph one and the first word in paragraph two. The bookmarkStart and bookmarkEnd elements (§2.13.6.2; §2.13.6.1) specify the location where the bookmark starts and ends, but cannot contain it using a single tag because it spans part of two paragraphs. However, the two tags are part of one group because the id attribute value specifies 0 for both. end example] The example:

XML:

<w:p>
   <w:r>
     <w:t>Example</w:t>
   </w:r>
   <w:bookmarkStart w:id="0" w:name="sampleBookmark" />
   <w:r>
     <w:t xml:space="preserve"> text.</w:t>
   </w:r>
</w:p>
<w:p>
  <w:r>
    <w:t>Example</w:t>
  </w:r>
  <w:bookmarkEnd w:id="0" />
  <w:r>
    <w:t xml:space="preserve"> text.</w:t>
  </w:r>
</w:p>

bookmarkStart

This element specifies the start of a bookmark within a WordprocessingML document. This start marker is matched with the appropriately paired end marker by matching the value of the id attribute from the associated bookmarkEnd element. If no bookmarkEnd element exists subsequent to this element in document order with a matching id attribute value, then this element is ignored and no bookmark is present in the document with this name. If a bookmark begins and ends within a single table, it is possible for that bookmark to cover discontiguous parts of that table which are logically related (e.g. a single column in a table). This type of placement for a bookmark is accomplished (and described in detail) on the colFirst and colLast attributes on this element. [Example: Consider a document with a bookmark which spans half of paragraph one, and part of paragraph two. The bookmarkStart element specifies the start of the region for the testing123 bookmark. This element is then linked to the bookmarkEnd element which also has an id attribute value of 0. end example]

bookmarkEnd

This element specifies the end of a bookmark within a WordprocessingML document. This end marker is matched with the appropriately paired start marker by matching the value of the id attribute from the associated bookmarkStart element. If no bookmarkStart element exists prior to this element in document order with a matching id attribute value, then this element is ignored and no bookmark is present in the document with this name. [Example: Consider a document with a bookmark which spans half of paragraph one, and part of paragraph two. The bookmarkEnd element specifies the end of the region for the bookmark whose bookmarkStart element has an id attribute value of 0. In this case, this refers to the testing123 bookmark. end example]

The following WordprocessingML illustrates an example of content which fufills this constraint:


<w:p>
   <w:r>
     <w:t xml:space="preserve">This is sentence one.</w:t>
   </w:r>
   <w:bookmarkStart w:id="0" w:name="testing123"/>
   <w:r>
     <w:t>This is sentence two.</w:t>
   </w:r>
</w:p>
<w:p>
   <w:r>
     <w:t xml:space="preserve">This </w:t>
   </w:r>
   <w:bookmarkEnd w:id="0"/>
   <w:r>
     <w:t>is sentence three.</w:t>
   </w:r>
</w:p>

Benjamin-T commented 7 years ago

Initial proposal

I would place the method 'add_bookmark()' in the file: docx.text.run.py and the xml object in the docx.oxml.run.py

docx.oxml.run

the bookmarkStart element:

from docx.oxml import OxmlElement
def add_bookmark_start(self, name='test_bookmark'):
    bmrk = OxmlElement('w:bookmarkStart')
    bmrk.set(qn('w:id'), '1')
    bmrk.set(qn('w:name'), name)
    return bmrk

The bookmarkEnd element:
def add_bookmark_end(self):
    bmrk = OxmlElement('w:bookmarkEnd')
    bmrk.set(qn('w:id'), '1')
    return bmrk

Challenges:

There should be a link between the start and end object? At least the 'id' should be identical for each individual bookmark.
A bookmark name needs to given

docx.text.run

In the docx.text.run.py file the following method should be added:

def add_bookmark(self, name='test_bookmark'):
    self._r.append(self._r.add_bookmark_start(name=name))
    self._r.append(self._r.add_bookmark_end())

Simple test code:

import docx
test = docx.Document()
par = test.add_paragraph()
test_run = par.add_run()
test_run.add_text('Test')
test_run.add_bookmark()
test.save('Test_bookmark.docx')

Benjamin-T commented 7 years ago

Feature

I've never worked with behave, but this is my first shot at a feature text:

The feature text:

 Feature: Add bookmark to a run
  In order to add a bookmark to my run
  As a developer using python-docx
  I need a way to add a bookmark to a run
   Scenario: Add a bookmark
    Given a run
     When I add a bookmark
     Then the bookmark appears in that location in the run.

Question: How do i actually run the tests, and do I have to submit the feature including the updated steps\text.py file?

After some googling, I understand running tests is done like this:

Functional tests:

Go to directory docx\features
run behave from commandline

Unittests:

Go to directory docx\tests
run pytest from commandline. (I do get some errors when running pytests, I'm running windows 10, anybody else getting errors?)

To Do:

[ ] Investigate the behavior of current code and get comments
[ ] Create a behave test for the add_bookmark method
[ ] Create a unittest testing all the underlying implications of the bookmark

scanny commented 7 years ago

Regarding how to add an element, in python-docx, each new element type is represented by its own custom element class.

Each custom element class is named after the XML Schema type definition it represents. For example, a <w:r> (run) element is defined in the schema by definition named CT_R ("CT_" standing for "complex type"), so its custom element class is named CT_R.
Each tag name is mapped to its custom element class in pptx/oxml/__init__.py. More than one tag name can be mapped to each custom element class.
The custom element class is located with the other element classes it coheres most strongly with. This is a matter of judgement but I rarely find myself with much doubt about where one should go (or where to look for it later).
Each custom element class subclasses BaseOxmlElement. This is the metaclass hierarchy that allows standard properties and methods to be defined declaratively, for example with ZeroOrOne() or OptionalAttribute(). There are a lot of examples of these throughout the pptx.oxml subpackage, let me know with any specific questions after exploring some of those to see the patterns.
The metaclass provides standard methods and properties such as get_or_add_{x}(), _remove_{x}() and {x} (where x is the name of the class variable to which the results of say ZeroOrOne(...) is assigned to in a custom element class definition. More on this later if you need it.
Custom element classes are not tested in isolation (except in certain unusual cases); rather they are tested integrated with the code that calls them by providing starting XML and expected XML. You'll see many examples of this in the unit tests. Look for calls to element() and xml() in the unit test fixtures; these are setting up before and after element/XML objects respectively.

scanny commented 7 years ago

Regarding the API/protocol for this, we'll need to answer a couple XML Schema-related questions, but the call to add a bookmark will definitely not be on a Run object (a bookmark element is a peer to a run, not its child). It could be on a Paragraph object, but may need to go higher, depending on what the possible "parent" elements of a bookmark start and end are.

I'd be inclined (but still in brainstorming mode) to something like:

bookmark = paragraph.start_bookmark('my bookmark name')
# ... add more runs, maybe more paragraphs
same_or_later_paragraph.end_bookmark(bookmark)

A couple notes:

I'm thinking the object named "bookmark" above is actually something like a BookmarkMemo object, or something like that. I don't believe it will actually be a bookmark. I think we will have bookmarks later, perhaps something like Document.bookmarks, where you want to look one up by name or something and then do things with it. But that's what analysis is for, we'll see how that develops as the rest of the bookmarking API emerges.
I'll bet the bookmark IDs have to be unique, but the user shouldn't have to keep track of those. We'll just assign the next available unique one on the .start_bookmark() call. One virtue of the .end_bookmark(bookmark) interface is the ID is available from bookmark without the caller ever having to know what it is.
You'll need to search the XML Schema to see what the possible parents of bookmark elements are. I'm willing to bet it's not just paragraphs.

scanny commented 7 years ago

Regarding acceptance tests, these can only exercise the API, and therefore can only test what the API allows us to detect.

A consequence of this is we need the "read" API before (usually just before) we need the "write" API.

So you probably want to be thinking of how you find bookmarks in a document and what one would want to do with them once found.

This is about the time I start taking a close look at the Microsoft Visual Basic API for Word to see how they're handled there. A Google search on 'vba word bookmark' is also a worthwhile exercise, to get a sense for the VBA calling protocol for bookmarks. This will often uncover unexpected complexities that would be very expensive to discover late in the process.

https://msdn.microsoft.com/en-us/VBA/Word-VBA/articles/object-model-word-vba-reference

Benjamin-T commented 7 years ago

First of all, thanks for the detailed response @scanny. I did not give the general implementation much thought, but I think it makes this project a bit more challenging. I've briefly looked into the XML reference, and you're right, there are a lot of different parent objects.

WordprocessingML - documentation

Apparently, the bookmark XML predates the concepts of the XML: Within a WordprocessingML document, bookmarks refer to arbitrary regions of content which are bounded and have a unique name associated with them. Because bookmarks are a legacy word processing function which predates the concepts of XML and wellformedness, they can start and end at any location within a document's contents and therefore must use the "cross-structure" annotation format described in §2.13.2. This makes this certainly an even more interesting challenge, the only thing that is important is that a bookmarkStart element is joined with a bookmarkEnd element with an equal 'id'-attribute value.

They present the following example, of a bookmark spanning a paragraph:

<w:p>
  <w:r>
    <w:t>Example</w:t>
  </w:r>
  <w:bookmarkStart w:id="0" w:name="sampleBookmark" />
  <w:r>
    <w:t xml:space="preserve"> text.</w:t>
  </w:r>
</w:p>
<w:p>
  <w:r>
    <w:t>Example</w:t>
  </w:r>
  <w:bookmarkEnd w:id="0" />
  <w:r>
    <w:t xml:space="preserve"> text.</w:t>
  </w:r>
</w:p>

It seems that therefore the element should be available at 'any' point. The document lists the possible parent objects as well. Its a large list at first, but a lot of it isn't implemented (yet) in the docx package. I've listed the possible parents, and grouped them a bit, sorted them based on the level and likelihood of implementation. - Ironically, the run object is practically the only object in the XML structure that isn't a possible parent of the bookmark object.

Element	Name	Element Included?
body	body	yes
p	paragraph	yes
tbl	Table object	yes
tc	Table Cell	yes
tr	Table Row	yes
endnote	Endnote Content	no
footnote	footnote content	no
ftr	footer content	no
hdr	header content	no
hyperlink	hyperlink	no
fldSimple	Simple field, determined by fieldcode	no
comment	comment	no
del	deleted run content	no
ins	inserted run content (like tracked changes)	no
moveFrom	Move Source Run Content (i.e. tracked changes)	no
moveTo	Move Destination Run Content (i.e. tracked changes)	no
customXml	cell level custom xml	no
customXml	row level custom xml	no
customXml	inline-level custom xml	no
customXml	block level custom xml	no
deg	math sign, degree	no
den	denominator (math)	no
e	specified base arg. of math func	no
lim	math, function name limit	no
num	math, numerator	no
oMath	math, Office Math	no
fName	math, function apply obj. like sin, cos	no
sub	Subscript (Pre-Sub-Superscript)	no?
sup	Superscript (Superscript function)	no?
docPartBody	Contents of Glossary Document Entry	no
txbxContent	Rich Text Box Content Container	no
smartTag	specifies presence of a tag around inline structureslno
sdtContent	Block-Level Structured Document Tag Content	no?
sdtContent	Cell-Level Structured Document Tag Content	no?
sdtContent	Row-Level Structured Document Tag Content	no?
sdtContent	Inline-Level Structured Document Tag Content	no?
rt	Phonetic Guide Text	no
rubyBase	Phonetic Guide Base Text	no

Requirements:

For now just thought of two, probably more.

Must have an unique name
Must have a matching id-attribute.

Custom Element Class

For the bookmark to work, we would need two custom element classes: CT_BookmarkStart and CT_BookmarkEnd.

Benjamin-T commented 7 years ago

I've renamed the feature to add_bookmark only. As it looks like it's a better idea to first fully focus on the correct implementation of the bookmark object, and afterwards see how things like captions and cross references interact with the bookmarks.

More detailed VBA function analysis

Like you recommended, I've investigated the VBA bookmark documentation. This provided some more background and insight about what kind of behavior and functionality is expected from the bookmark object.

Bookmarks

A collection of bookmark object that represent the bookmarks in the specified selection, range, or document. Also, the object returns the bookmark corresponding to the index, which is either a key or a number. The numbered index can be used, and the named indices can apparently be sorted alphabetically.
There is also a set of predefined bookmarks - not sure whether these can be accessed by the package.

Bookmark methods:

There are three methods described:

Copy - copy method of the bookmark, requires a new name as input, which will subsequently be the name of the copied bookmark.
Delete - Deletes the bookmark from the bookmarks object.
Select - Selects the part spanned by the bookmark. (is this even possible to implement in this package?) The select method returns a 'Selection' object, which in turn can be used to apply certain formatter.

Bookmark properties:

There is a total of ten properties:

Application - Returns the word application
Column - Boolean, True if specified bookmark is a table column.
Creator - Integer indicating the application which was used to create the object.
Empty - True if specified bookmark is empty
End - Returns or sets the ending character position of a selection, range, or bookmark. Read/write link
Name - Returns the name of the bookmark
Parent - Returns the parent object of the specified bookmark object.
Range - Returns a Range object that represents the portion of a document that's contained in the specified object.
Start - Returns or sets the starting character position of a bookmark. Read/write
StoryType - Returns the story type for the specified range, selection, or bookmark. Read-only

Properties Included or Omitted

I think features 1, 3, 10 can be omitted, and I'm unsure about 8. This leaves features, 2, 4, 5, 6, 7, 8, 9. Which is already a nice list.

Correct me if I'm wrong, but can I conclude from this that there should be some Container object for bookmarks?

Benjamin-T commented 7 years ago

Bookmarks - Object

Continuing on the bookmarks 'container' object, it has the following properties and methods:

Properties

Application - The word application
Count - The total number of items in the bookmarks collections (read only)
Creator - Application integer
Sorting - Sort by location, or sort alphabetically (It should only be a mapping; " This property doesn't affect the order of Bookmark objects in the Bookmarks collection." click
Parent - Returns the object corresponding to the specifeid bookmarks collection (unclear what is exactly meant by this, does it return the parent object of a single bookmark of from all the bookmarks?)
ShowHidden - True if hidden bookmarks are included in the bookmarks collection, Read/Write boolean
Methods:
Add - Returns a Bookmark object that represents a bookmark added to a range .Add( Name, Range) Name: The name of the bookmark. The name cannot be more than 40 characters or include more than one word. Range: The range of text marked by the bookmark. A bookmark can be set to a collapsed range (the insertion point)
Exists - Determines whether the specified bookmark exists. Returns True if the bookmark exists.
Item - Returns an individual bookmark object in a collection (takes an index as input)

scanny commented 7 years ago

Okay, regarding the parentage aspect, I'm thinking the key parent elements to start with are w:body and w:p. Maybe the table ones come in later, I'll have to noodle that a bit more, but table cells contain paragraphs, so you'd already be able to put one inside a cell, and I can't think of any super-common uses for a bookmarked cell, row, or table at the moment.

One possibility is to implement the bookmark functionality as a mixin, so it can be added to multiple "locations" without duplicating the code. That would also provide a uniform interface, regardless of the parent object. Maybe we call that BookmarkParentMixin or maybe something a little better. That approach would also make it a lot easier to add different "parents" later on.

scanny commented 7 years ago

Regarding the MS-API for bookmarks:

Good call on limiting scope here to bookmarks. That will be plenty of scope for the time being I think :)
It looks like the MS-API makes a Bookmarks collection available on all of Document, Range, and Selection. I think we only need Document for now. We don't have the concept of Range or Selection in python-docx. (Range we may end up having at some point, Selection probably not because it corresponds to the state of a running Word application.)
I don't see an immediate need for a .copy() method. I'm open to opposing views, but I don't see the immediate use case.
I would place .delete() or perhaps .remove() in the "maybe later" category, unless you see a clear use case for it. It doesn't seem like it would be hard to implement, but I think we want to jealously limit scope for the time being. These things have a tendency to grow beyond one's ambition to complete them :)
A .select() method doesn't make any sense for python-pptx for the "state in running application" reason I mentioned earlier.
Of the ten properties, I would focus only on .name, possibly also including .parent, and maybe .empty if we think anyone cares about those two. .Application, .Creator, .End, .Range, and .Start all refer to concepts that don't exist in python-docx. .StoryType might be possible but I just don't see an active use for that. So my list shortens to 6, with 4 and 7 as possibly nice-to-haves.
The lack of a range concept in python-docx though, leaves use without any interface to get to the "contents" of a bookmark. That probably bears some thought. We could implement Bookmark.paragraphs, although we'd have to work out what to do with a "partial" first or last paragraph. This one should go on the open noodle list.
There definitely needs to be a Bookmarks object, available, I'm thinking, on Document.bookmarks. I'm inclined to believe it would need hybrid semantics, like some combination of sequence and mapping (list and dict). It would need to be iterable, it would have a length (support .len()), and I'm thinking indexed access (e.g. bookmarks[7]) would have sequence semantics. Key lookup (by name) could be with a .get() method, with the same interface as that method on a dict (e.g. optional second parameter to specify default return value). We could think about a "lookup by ID" method if we thought that would be needed, maybe .get_by_id(id). Along those lines, the analysis should determine whether bookmark names are enforced unique and what the lookup behavior is when two have the same name. I think it's safe to assume bookmark IDs are unique.

scanny commented 7 years ago

Regarding the MS-API for the Bookmarks collection:

.Application and .Creator are not applicable. .Creator could possibly be used, like somehow assigning an identifier specific to python-docx, but we've never done that before and I don't see the compelling use case for it so far.
Bookmarks.Count maps to len(Bookmarks), of course.
On sorting, I'm not seeing the immediate need for that. I'd be inclined to just skip that as someone who needed it could easily do it themselves. Bookmark objects would already appear in document order; if they wanted them by name that would be two or three lines of client code for the caller.
.Parent we can skip as the parent is always the Document object so far, which they just used to get the Bookmarks object.
.ShowHidden I don't think we can do because we don't track hidden parts of the document (and I'm not exactly sure what hidden status actually is).

As for the methods:

.Add() we can't do because we don't have ranges. We'll have the .start_bookmark() and .end_bookmark() methods folks can use for this job (or whatever we end up deciding on).
.Exists() - I'm not sure of the semantics for this one, but if it's just finding out if a bookmark with a certain name exists, folks can do that easily enough with something like:
```
bookmark = Bookmarks.get(name)
if bookmark is None:
  # do False thing
else:
  # do True thing
```
.Item() is just indexed access.

So in short, I think we're good with a Bookmarks object that has the behaviors I outlined in the last bullet of the previous comment.

Benjamin-T commented 7 years ago

Allright, summarizing: If I understand correctly we're going to develop a Bookmarks object, which will contain a bookmark object.

the Bookmark object will initially only contain the following properties / attributes:

Bookmark - Object

Attributes / properties

Name - Returns the name of the bookmark
Start - Returns or sets the starting character position of a bookmark. Read/write
End - Returns or sets the ending character position of a selection, range, or bookmark. Read/write link

These could be added in the 'future'.
Empty - True if specified bookmark is empty
Parent - Returns the parent object of the specified bookmark object.
Column - Boolean, True if specified bookmark is a table column.

The empty property could come in handy when one wants to refer to the text in the bookmark. Both the parent and column properties can be included when the bookmark is extended to include the table, since this introduces a possible differen parent? (in its current form, it will always be the document, but in the case of the table, it will be the table object, right? )

Methods / setters

start_bookmark
end_bookmark
bookmark text - not sure where to put this, but the bookmark can contain a specific text.

Bookmarks - Object

The bookmarks object will be placed within the document object.

Attributes / properties:

Count property len()
Exists might be required as duplicated bookmarks can result in unwanted behaviors.

methods

add_bookmark - returns a bookmark object which has the startstart_bookmark and end_bookmark methods.
three get-methods:
- get_by_index
- get_by_name
- get_by_id() -- not sure whether word edits this, and what kind of added value this method has.

I'm currently trying to develop some kind of basic version of the above.

Benjamin-T commented 7 years ago

Without thinking about the python-docx package too much, I had something like this in mind. I'm familiar with a lot of the super() and metaclass concepts, therefore implementing this will be a challenge ;)

from docx.oxml.ns import qn
from docx.oxml import OxmlElement

class Bookmark(object):
    def __init__(self):
        self.bmrk_start = OxmlElement('w:bookmarkStart')
        self.bmrk_end = OxmlElement('w:bookmarkEnd')
        self.name = None
        self.id = None

    def set_id(self, id):
        self.id = id
        self.bmrk_start.set(qn('w:id'), str(id))
        self.bmrk_end.set(qn('w:id'), str(id))
        return self

    def set_name(self, name):
        self.name = name
        return self

    def start_bookmark(self):
        return self.bmrk_start

    def end_bookmark(self):
        return self.bmrk_en

class Bookmarks(object):
    def __init__(self):
        self.bookmarks = []
        self.mapping = []

    def __iter__(self):
        for bookmark in self.bookmarks:
            yield bookmark

    @property
    def count(self):
        return len(self.bookmarks)

    def add_bookmark(self, name):
        idx = self.count + 1
        bm = Bookmark().set_id(idx)
        bm.set_name(name)
        self.bookmarks.append(bm)
        self.mapping.append(name)
        return bm

    def get_bookmark(self, name):
        if name in self.mapping:
            return self.mapping.index(name)

    def item(self, idx):
        return self.bookmarks[idx]

scanny commented 7 years ago

Okay, so lets start with your proposed interface for Bookmark.

I don't see how you're going to implement a .start or .end property, at least not with the same semantics as those in the MS Bookmark object. These property values are character positions in the document, meaning an integer count of the document characters before that location (e.g. 1042). This would require exhaustively visiting each prior run to determine the number of characters it contains. It would also require analysis to figure out which runs should be skipped, like maybe they're hidden or deleted or whatever and maybe those characters count or maybe they don't. Also you'd need to work out how multi-byte characters are counted.

Implementation aside though, I'm not sure what anyone would do with those values if they were available. I think these two are clearly on the "skip" list.
I'm not sure I buy your argument about having a .empty property, but it does make me think it could be very useful to have a .text property. This might be enough of a first capability to support the common use cases we can foresee without getting into the whole .Range challenge.
Overall, I think this is the moment we really need to hold tightly to the YAGNI principle (you aren't going to need it). Rather we should be driven by concrete use cases and how they can be supported. Whenever we utter the phrase "would be handy" or "would be nice to have" we need to be suspect. If we keep things simple, we can always come back later to elaborate things. If we bite off too much it's very likely to bog down and stall.
With that in mind, I think the appropriate step here is to explore the protocol, which is basically the calling sequence to get something useful done. Here's an example for dropping a bookmark at the beginning of a document:
```
>>> document = Document()
>>> bookmark = document.start_bookmark('Planted Flag')
>>> document.add_paragraph('We wuz here!')
>>> document.end_bookmark(bookmark)
>>> bookmark.bookmark_id
1
>>> bookmark.name
'Planted Flag'
>>> bookmark.text
'We wuz here!'
```
Although we've pushed the captions and so on out of scope for this implementation, perhaps they're good use cases we can use to explore the proposed interfaces we're working on here.

scanny commented 7 years ago

Regarding implementation details, let's start at the "outside" with Document.bookmarks:

class Document(...):

    # ---other properties and methods ...

    @lazyproperty
    def bookmarks(self):
        return Bookmarks(self._element)

I don't know if you're familiar with the @lazyproperty decorator. It's not an official one, but I do find that I use it in all my projects. It's defined in docx.shared or someplace like that. It borrows from functional programming, like Haskell, that this property is a fixed (immutable) value, but that value is only calculated on first use (hence the term "lazy"). This has a number of virtues, but an important one is that it provides a way to avoid doing "real work" in the constructor (__init__() method). In this case notice there's no self.bookmarks = [] or whatever in the constructor.

Also very important here is that we avoid primitive obsession, meaning implementing a bookmarks collection using a primitive object like list (or dict). There are a lot of good reasons for this that Google will look up for you by that term.
The Bookmarks object needs a reference to the <w:document> element so it can locate and iterate over all the bookmark elements, etc. That's what self._element is.

So on to the Bookmarks class:

from collections import Sequence

class Bookmarks(Sequence):

    def __init__(self, document_elm):
        super(Bookmarks, self).__init__()
        self._document = self._element = self.document_elm

    def __iter__(self):
        for bookmarkStart in self._bookmarkStarts:
            yield Bookmark(bookmarkStart)

    def __getitem__(self, idx):
        bookmarkStart = self._bookmarkStarts[idx]
        return Bookmark(bookmarkStart)

    def __len__(self):
        return len(self._bookmarkStarts)

    def get(self, name, default=None):
        for bookmarkStart in self._bookmarkStarts:
            if bookmarkStart.name == name:
                return Bookmark(bookmarkStart)
        return default

    @property
    def _bookmarkStarts(self):
        return self._document.xpath('.//w:bookmarkStart')

Bookmarks subclasses (built-in) Sequence, which gives it many of the nice-to-have sequence behaviors like slicing, reversed, contains, etc.
The Sequence base class requires that we implement .__getitem__() and .__len__(). I've added __iter__() just because the default implementation isn't terrifically performant.
We won't have a .count() method, just because it's more pythonic to use the built-in len() function and we need the .__len__() method anyway, although the pythonic argument is compelling enough for me by itself in this case.
I'm not seeing why we would put an .add_bookmark() method on Bookmarks. We have the start and end bookmark methods on Document and I don't immediately see another way to add a bookmark, unless we have something like .bookmark_paragraph(paragraph) or .bookmark_run(run) or something. This probably bears further discussion as the start and end interface only works at the end of the document (when generating) and some folks are likely to want to insert a bookmark in the middle somewhere.
I've used .get() instead of .get_bookmark(). .get() is a standared dict interface method and we're giving it the same behaviors here, so better to stick to convention (principle of least surprise).
The .item(n) method is particular to Visual Basic. Here in Python we just use bookmarks[n] to accomplish the same thing. This functionality is provided by the .__getitem__() magic method.

I'm out of time right now, let's leave it there and let me know how you digest this chunk, then we can get into the Bookmark class and what the oxml layer beneath it looks like. I think you'll find that pretty elegant and easy to understand once you see it more closely.

Benjamin-T commented 6 years ago

First of all, thanks for all the ideas and input. I really feel like I'm learning a lot here. And I really appreciate you taking the time to explain me some of the concepts.

I know its a bit too soon, but I really wanted to give it a go, so I've tried to create the CT_BookmarkStart and CT_bookmarkEnd elements. I was not sure where to put them, so I created the bookmarks.py file.

If you could please take a look and see if I'm on the right track? I really struggled with understanding how these CT_.... objects were supposed to be created. So I ended up with the ones I've commited. The code works, but that's not the only goal here ;)

github-link

With regard to the Bookmarks object, I really think the first try you posted is an elegant way to create a list-like object with a lot of nice functionalities.
I haven't taken a look at the lazyproperty decorator yet.
I also haven't tried to actually implement your first Bookmarks object. Maybe I get around doing this upcoming week.

scanny commented 6 years ago

If you want to submit code for review, a pull request (PR) is the best way to do it. You might need to to a little research first. The key thing is to be able to make and submit changes without creating a new pull request each time. Basically, the way you do this is creating a new branch, based on master (on your fork) and making that branch the base of the PR. When you make changes, you rebase that branch to suit, and the PR is automatically updated (when you push that branch).

If you're not familiar with rebasing with Git you'll need to study up; it's an essential skill for this kind of interaction.

Then I can make comments directly on the code changes you submitted; you can then make changes, rebase the branch and push, and submit the revised code.

Regarding your first stab:

I think bookmark.py is fine for now. We might change it to fields.py or something later, depending on what else ends up in there, but a renaming is easily done once things become more clear.
You'll want the XML Schema extract in the analysis document before you do too much custom element class definition. The custom element class is essentially an implementation of the XML Schema definition for that/those XML elements. I can't tell if it's correct without reference to the related XML Schema excerpts.
The class names should exactly match their XML Schema counterparts. So it's CT_BookmarkStart, not CT_bookmarkStart.
You don't need set_x methods on a custom element class. The metaclass adds the properties and methods you need just by you adding the x = OneOrMore('p:x', ...) etc. class variables. So if you want to set the id value, you just sah bookmarkStart.id = 42. This is alot like how SQLAlchemy and Django work in their declarative aspects and represents the magic of MetaClasses.
I'm not sure what you intended with the _bookmark() methods, but those don't need to be there. At this point, neither of those two classes should have any properties or methods other than those they get from the metaclass.

I'd recommend you focus on getting the pull request bit working and show me where you are on the analysis document, in particular the XML Schema excerpts, and then we can come back to these two definitions.

Btw, there's nothing wrong (in my mind) with jumping into the implementation, just to get straight in your mind how it's all going to work. I do that all the time (I call it 'spiking'). The only problem is thinking you're done once you have it working. That's where most folks stop and that's why they don't get a merge. From there you are in a good position to develop the acceptance and unit tests, flesh out the documentation required, and form it into a series of coherent commits that can be confidently committed without a lot of rework. All those things are required to actually add the feature without degrading the existing code base.

Benjamin-T commented 6 years ago

Allright! thanks for the feedback, I guess i have some studying to do.

I got rid of the set methods, however, the CT_BookmarkEnd and CT_BookmarkStart names do not rename to 'w:bookmarkEnd' and 'w:bookmarkStart'. But I dont understand why and how I can change this?

class CT_BookmarkStart(BaseOxmlElement):
    """
    ``<w:bookmarkStart>`` element
    """
    bookmark = OneOrMore('w:bookmarkStart', ST_String)
    name = RequiredAttribute('w:name', ST_String)
    bmrk_id = RequiredAttribute('w:id', ST_RelationshipId)

    def add_bookmark_start(self, id, name='test_bookmark'):
        bmrk = CT_BookmarkStart()
        bmrk.bookmark = 'w:bookmarkStart'
        bmrk.bmrk_id = str(id)
        bmrk.name = name
        self.append(bmrk)
        return bmrk

This leads to the following xml:

    <w:p>
      <w:r>
        <w:t>Hello</w:t>
      </w:r>
      <CT_BookmarkStart w:id="2" w:name="text_to_be_bookmarked" />
      <w:r>
        <w:t>Hello</w:t>
      </w:r>
      <CT_BookmarkEnd w:id="2" />
    </w:p>

Clearly Im overlooking something...

Benjamin-T commented 6 years ago

After hours of puzzeling and camparing code, I got it working! the 'catch' was 'registering' the bookmarkStart elements as xml elements using the self._add_bookmarkstart() method in the paragraph object.

I'also created a pull request like you asked; I purposly committed the 'removed obsolete code' commit to check if the pull request is automatically updated with the new code. Let me know if I need to change some settings.

the \oxml\paragraph.py file:

class CT_P(BaseOxmlElement):
    """
    ``<w:p>`` element, containing the properties and text for a paragraph.
    """
    pPr = ZeroOrOne('w:pPr')
    r = ZeroOrMore('w:r')
    bookmarkStart = ZeroOrMore('w:bookmarkStart')
    bookmarkEnd = ZeroOrMore('w:bookmarkEnd')

the text\paragraph.py file:

from __future__ import (
    absolute_import, division, print_function, unicode_literals
)

from ..enum.style import WD_STYLE_TYPE
from .parfmt import ParagraphFormat
from .run import Run
from ..shared import Parented

from docx.shared import ElementProxy
from docx.oxml.bookmark import CT_BookmarkStart, CT_BookmarkEnd 

class Paragraph(Parented):
    """
    Proxy object wrapping ``<w:p>`` element.
    """
    def __init__(self, p, parent):
        super(Paragraph, self).__init__(parent)
        self._p = self._element = p

    def add_bookmark_start(self, id, name):
        bmrk = self._p._add_bookmarkStart()
        bmrk.name = name
        bmrk.bmrk_id = str(id)
        return self._p.append(bmrk) 

    def add_bookmark_end(self, id):
        bmrk = self._p._add_bookmarkEnd()
        bmrk.bmrk_id =str(id)
        return self._p.append(bmrk)

and the elements:

class CT_BookmarkStart(BaseOxmlElement):
    """
    ``<w:bookmarkStart>`` element
    """
    name = RequiredAttribute('w:name', ST_String)
    bmrk_id = RequiredAttribute('w:id', ST_RelationshipId)

class CT_BookmarkEnd(BaseOxmlElement):
    """
    The ``<w:bookmarkEnd>`` element
    """
    bmrk_id = RequiredAttribute('w:id', ST_RelationshipId)

Benjamin-T commented 6 years ago

Since I finally got a half decent version working, I'll start focussing more on the development of the analysis documentation. Also I just found out that 'location' of the feature is within the CT_Body and not in the CT_P I've currently placed it.

The reason I placed it in the paragraph type is that the bookmarks are placed in a paragraph. Therefore i assumed that that would be the logical location.

This is interesting; the element type of the bookmarkEnd element is CT_MarkupRange

        <xsd:element name="bookmarkStart"               type="CT_Bookmark"/>
        <xsd:element name="bookmarkEnd"                 type="CT_MarkupRange"/>

  <xsd:complexType name="CT_Body">  <!-- denormalized -->
    <xsd:sequence>
      <xsd:choice minOccurs="0" maxOccurs="unbounded">
        <xsd:element name="p"                           type="CT_P"/>
        <xsd:element name="tbl"                         type="CT_Tbl"/>
        <xsd:element name="sdt"                         type="CT_SdtBlock"/>
        <xsd:element name="customXml"                   type="CT_CustomXmlBlock"/>
        <xsd:element name="altChunk"                    type="CT_AltChunk"/>
        <xsd:element name="proofErr"                    type="CT_ProofErr"/>
        <xsd:element name="permStart"                   type="CT_PermStart"/>
        <xsd:element name="permEnd"                     type="CT_Perm"/>
        <xsd:element name="bookmarkStart"               type="CT_Bookmark"/>
        <xsd:element name="bookmarkEnd"                 type="CT_MarkupRange"/>
        <xsd:element name="moveFromRangeStart"          type="CT_MoveBookmark"/>
        <xsd:element name="moveFromRangeEnd"            type="CT_MarkupRange"/>
        <xsd:element name="moveToRangeStart"            type="CT_MoveBookmark"/>
        <xsd:element name="moveToRangeEnd"              type="CT_MarkupRange"/>
        <xsd:element name="commentRangeStart"           type="CT_MarkupRange"/>
        <xsd:element name="commentRangeEnd"             type="CT_MarkupRange"/>
        <xsd:element name="customXmlInsRangeStart"      type="CT_TrackChange"/>
        <xsd:element name="customXmlInsRangeEnd"        type="CT_Markup"/>
        <xsd:element name="customXmlDelRangeStart"      type="CT_TrackChange"/>
        <xsd:element name="customXmlDelRangeEnd"        type="CT_Markup"/>
        <xsd:element name="customXmlMoveFromRangeStart" type="CT_TrackChange"/>
        <xsd:element name="customXmlMoveFromRangeEnd"   type="CT_Markup"/>
        <xsd:element name="customXmlMoveToRangeStart"   type="CT_TrackChange"/>
        <xsd:element name="customXmlMoveToRangeEnd"     type="CT_Markup"/>
        <xsd:element name="ins"                         type="CT_RunTrackChange"/>
        <xsd:element name="del"                         type="CT_RunTrackChange"/>
        <xsd:element name="moveFrom"                    type="CT_RunTrackChange"/>
        <xsd:element name="moveTo"                      type="CT_RunTrackChange"/>
        <xsd:element ref="m:oMathPara"/>
        <xsd:element ref="m:oMath"/>
      </xsd:choice>
      <xsd:element name="sectPr" minOccurs="0" maxOccurs="1" type="CT_SectPr"/>
    </xsd:sequence>
  </xsd:complexType>

Benjamin-T commented 6 years ago

from the wml.xsd excerpts:

  <xsd:complexType name="CT_BookmarkRange">
    <xsd:complexContent>
      <xsd:extension base="CT_MarkupRange">
        <xsd:attribute name="colFirst" type="ST_DecimalNumber" use="optional">
          <xsd:annotation>
            <xsd:documentation>First Table Column Covered By Bookmark</xsd:documentation>
          </xsd:annotation>
        </xsd:attribute>
        <xsd:attribute name="colLast" type="ST_DecimalNumber" use="optional">
          <xsd:annotation>
            <xsd:documentation>Last Table Column Covered By Bookmark</xsd:documentation>
          </xsd:annotation>
        </xsd:attribute>
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>
  <xsd:complexType name="CT_Bookmark">
    <xsd:complexContent>
      <xsd:extension base="CT_BookmarkRange">
        <xsd:attribute name="name" type="ST_String" use="required">
          <xsd:annotation>
            <xsd:documentation>Bookmark Name</xsd:documentation>
          </xsd:annotation>
        </xsd:attribute>
      </xsd:extension>
    </xsd:complexContent>
  </xsd:complexType>

  <xsd:group name="EG_RangeMarkupElements">
    <xsd:choice>
      <xsd:element name="bookmarkStart" type="CT_Bookmark">
        <xsd:annotation>
          <xsd:documentation>Bookmark Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="bookmarkEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Bookmark End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="moveFromRangeStart" type="CT_MoveBookmark">
        <xsd:annotation>
          <xsd:documentation>Move Source Location Container - Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="moveFromRangeEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Move Source Location Container - End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="moveToRangeStart" type="CT_MoveBookmark">
        <xsd:annotation>
          <xsd:documentation>Move Destination Location Container - Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="moveToRangeEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Move Destination Location Container - End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="commentRangeStart" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Comment Anchor Range Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="commentRangeEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Comment Anchor Range End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlInsRangeStart" type="CT_TrackChange">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Insertion Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlInsRangeEnd" type="CT_Markup">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Insertion End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlDelRangeStart" type="CT_TrackChange">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Deletion Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlDelRangeEnd" type="CT_Markup">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Deletion End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlMoveFromRangeStart" type="CT_TrackChange">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Move Source Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlMoveFromRangeEnd" type="CT_Markup">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Move Source End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlMoveToRangeStart" type="CT_TrackChange">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Move Destination Location Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="customXmlMoveToRangeEnd" type="CT_Markup">
        <xsd:annotation>
          <xsd:documentation>Custom XML Markup Move Destination Location End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
    </xsd:choice>
  </xsd:group>

scanny commented 6 years ago

Hmm, this is interesting. I haven't come across annotation and documentation strings in the schema before. Where did you pull these from?

The schema files I use are in the code tree here (and in the directory just above): https://github.com/python-openxml/python-docx/blob/master/ref/xsd/wml.xsd

scanny commented 6 years ago

Also I just found out that 'location' of the feature is within the CT_Body and not in the CT_P I've currently placed it.

The reason I placed it in the paragraph type is that the bookmarks are placed in a paragraph. Therefore i assumed that that would be the logical location.

Not sure what you mean by the "feature". This term is usually used here to indicate the API methods and properties added to implement some useful behavior.

But bookmarks appear to be part of both the body and paragraph objects and some others as well. So I'm not sure there's a clear "locale" for these, unlike most other objects in the code base.

A bookmark can be added to the body (document from the API perspective) just as well as inside a paragraph, and it's possible I expect that one can start in one place and end in the other.

So something to noodle on :)

Benjamin-T commented 6 years ago

Hmm, this is interesting. I haven't come across annotation and documentation strings in the schema before. Where did you pull these from? The schema files I use are in the code tree here (and in the directory just above): https://github.com/python-openxml/python-docx/blob/master/ref/xsd/wml.xsd

this is a really interesting document:

xml-reference

scanny commented 6 years ago

Ah, of course, the ECMA spec, I had forgotten about that. It's been a while, so I'm not sure of the current status, but when I started the project I was using the ECMA spec until I started puzzling over why certain elements weren't in there or were named differently.

The short story (at the time at least) is that Microsoft uses the ISO spec (ISO/IEC 29500).

Anyway, the ECMA spec might be useful for reference, but I believe the ISO spec is still the definitive one. I'll be looking for XML Schema excerpts from the ISO spec/schemas.

Benjamin-T commented 6 years ago

@Scanny, I've studied up on the rebasing, and I think I got it figured out. It's a really neat trick! This makes collaboration a lot easier. Normally, I was always 'worried' about the number of commits for a certain fix, but with this rebasing you can combine / and reorder and hence restructure your work afterworks.

Having said that, I've taken a stab at the feature documentation. See pull request

With regard to the objects, we still had to figure out how to link the two parts of a Bookmark (bookmarkStart and bookmarkEnd) together, if there is like a paragraph in between.

My suggestion would be to make two ways to add a bookmark;

# This adds a paragraph containing the start and stop of the bookmark
document.add_bookmark(name='test', inplace=True)

# This places a bookmark start:
bookmark_obj_1 = document.add_bookmark(name='test_1')
bookmark_obj_1.start_bookmark()

# Start some new run, with text which maybe contains another bookmark:
par = document.add_paragraph()
bookmark_obj_2 = par.add_bookmark(name'test_2')
bookmark.start_bookmark()
# In this case, the bookmark contains the text, which could be referred to
par.add_run('sometext')
bookmark_obj_2.end_bookmark()

# Add the end of the first bookmark:
bookmark_obj_1.end_bookmark()

This does mean that one has to correctly place the bookmark object to be able to succesfully add the end the bookmark. I'm not entirely sure about the exact implementation. I think it has some challenges;

The bookmark object needs to 'know' in which paragraph or run it needs to be placed.
The user needs to keep track of the exact object which contains the info about the specific bookmark. It might be more elegant to be able to end the bookmark by calling document.end_bookmark(name='test') and this places the corresponding <w:bookmarkEnd> element.

I'll try to get an example up and running.

Benjamin-T commented 6 years ago

So, I've had my first struggles with git, but somehow managed to figure it out. I think it had something to do with the fact that I use both Sourcetree and the git shell commands. I think i forgot to fetch from all remotes and that is where the fun started..

What I've tried to do, is implement the Bookmarks and Bookmark object. And make them available in the document class. Currently they do generate a bookmark, but this bookmark is placed at the end of the Body.

using this code:

doc = Document()
asdf = doc.bookmark()
new_bookmark = asdf.add_bookmark()
new_bookmark.add_bookmark_start(name='test_1', id=1)
par = doc.add_paragraph()
par.add_run('test')
new_bookmark.add_bookmark_end(id=1)
new_bookmark_2 = asdf.add_bookmark()
new_bookmark_2.add_bookmark_start(name='test_2', id=2)
new_bookmark_2.add_bookmark_end(id=2)
doc.save('tab.docx')

<w:document>
  <w:body>
    <w:p>
      <w:r>
        <w:t>test</w:t>
      </w:r>
    </w:p>
    <w:sectPr w:rsidRPr="0006063C" w:rsidR="00FC693F" w:rsidSect="00034616">
      <w:pgSz w:w="12240" w:h="15840" />
      <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0" />
      <w:cols w:space="720" />
      <w:docGrid w:linePitch="360" />
    </w:sectPr>
    <w:bookmarkStart w:name="test_1" w:id="1" />
    <w:bookmarkEnd w:id="1" />
    <w:bookmarkStart w:name="test_2" w:id="2" />
    <w:bookmarkEnd w:id="1" />
  </w:body>
</w:document>

I hope I have some time tomorrow to figure it out, I'm thinking that the bookmark should be within a paragraph instance.

regards, Ben

scanny commented 6 years ago

Okay, so here's a little about what I think the code structure will look like. These sorts of things (proxy objects) all look the same in many respects. The term proxy object is just our name for objects like Paragraph and Document that "wrap" (proxy) an XML element, providing API calls that raise the abstraction level from the lxml/oxml level up to something more suitable as a developer interface.

Let's start with the calling protocol I think we're after. This might bear more discussion, but this will do for a start:

>>> document = Document()
>>> bookmark = document.start_bookmark(name='test')
>>> document.add_paragraph('test bookmark body')
<Paragraph object at 0x...>
>>> document.end_bookmark(bookmark)

I've explicitly named the parameters, just for clarity. I expect there's no good reason not to allow them to be identified by position, e.g. .start_bookmark('test') should be allowed. Later there will also need to be Paragraph.start_bookmark() and Paragraph.end_bookmark() and perhaps others, all of which can be mixed and matched (start in any, end in same or different ). But this is a good first step, much of which will be reusable or adaptable for the other cases.

So we'd want the code above to produce this XML:

<w:document>
  <w:body>
    <w:bookmarkStart w:name="test" w:id="1"/>
    <w:p>
      <w:r>
        <w:t>test</w:t>
      </w:r>
    </w:p>
    <w:bookmarkEnd w:id="1"/>
    <w:sectPr w:rsidRPr="0006063C" w:rsidR="00FC693F" w:rsidSect="00034616">
      <w:pgSz w:w="12240" w:h="15840"/>
      <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
      <w:cols w:space="720"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>

The first challenge is to get the bookmarkStart and End elements to appear above the sentinel w:sectPr element. It looks like that can be done by adding this line to CT_Body in docx.oxml.document:

bookmarkStart = ZeroOrMore('w:bookmarkStart', successors=('w:sectPr',))

The implementation in docx.document._Body would be something like this::

def start_bookmark(self, name):
    bookmarkStart = self._body.add_bookmarkStart(name)
    return Bookmark(bookmarkStart)

This leaves some of the work to other objects. In particular, the bookmark id must be resolved and assigned by CT_Body.add_bookmark(). There is already an .add_bookmark() method on CT_Body, placed there by the metaclass as a consequence of the ZeroOrMore() call above. That will need to be overridden to give it that further functionality. An alternative would be to use a different name like .add_bookmarkStart_with_next_id() or something, but I typically just override. I'd have to look around for another element class that does that to what all needs to go in there.

In addition, I expect there needs to be a ._next_available_bookmark_id helper property on CT_Body that can take care of that little chore. .add_bookmarkStart() would automatically assign the next available bookmark id and then it would be available from the bookmarkStart element that's passed to the Bookmark constructor.

This may be a good place to mention that a proxy object (like Bookmark) should have no state of its own. All state is in the XML. So, for example, if you wanted to add a property Bookmark.id, when called, it would go to the XML to find out and return that value. There would be no ._id instance variable hanging around to get out of sync with the XML. Its implementation would be something like return self._element.id.

Note that .end_bookmark() is on Document, and that another method of the same name will need to be on Paragraph and anywhere else a bookmark can end. Just because it starts in Document (under the w:body element) doesn't mean it has to end there. But in either case, a bookmark can't end itself. It needs reference to a context element (e.g. w:body, w:paragraph, etc.) and so the most natural interface I think is on the proxy object for that element (e.g. Body or Paragraph respectively). I suppose an interface like Bookmark.start_on(paragraph) and Bookmark.end_on(document) is possible, but it's not screaming out to me so far as a superior design.

Does that give you something to noodle on for next steps?

scanny commented 6 years ago

@scanny, I've studied up on the rebasing, and I think I got it figured out. It's a really neat trick! This makes collaboration a lot easier. Normally, I was always 'worried' about the number of commits for a certain fix, but with this rebasing you can combine / and reorder and hence restructure your work afterworks.

It is a neat trick, isn't it! :)

I call when I'm getting the commits cleaned up "grooming" the commit history/sequence. It's not always worthwhile in every situation, but when you're collaborating it's a really nice way to make your commit(s) reflect a clear, single focus and be separately mergeable.

Benjamin-T commented 6 years ago

So I've succeeded in getting the elements above the sentinel object, but am now stuck on the item id's.

my initial thought is making the Bookmarks object available within the _Body object. This allows for getting 'new' id's. However, I'm still trying to figure out a neat solution for identifying bookmarkStarts without bookmarkEnds.

current approach:

Add a property to the Bookmarks object: _next_id. (see below, the Bookmarks class)

Struggling with finding the 'open' bookmarks: This should be a property or method of the Bookmarks class as well.

@property
def _unclosed_bookmarks(self):
closed = set(Bookmark(bmrk)._id for bmrk in self._bookmarkEnds)
open_b = set(Bookmark(bmrk)._id for bmrk in self._bookmarkStarts)
check = open_b - closed
if len(self._bookmarkStarts) == 0:
        raise ValueError('BookmarkEnd requested without a bookmarkStart present')
if len(check) == 0:
    raise ValueError('No open bookmark found')
else:
    return check.pop()

This should work, however, if there are multiple open bookmarks, there are more id's and then there is no way of identifying the right one without specifying the corresponding bookmark name.

What are your thoughts on this?

The docx.document.py code:

class Bookmark(ElementProxy):
    def __init__(self, doc_element):
        super(Bookmark, self).__init__(doc_element)
        self._element = doc_element

    @property
    def _id(self):
        return self._element.bmrk_id

    @property
    def _name(self):
        return self._element.name

class _Body(BlockItemContainer):
    """
    Proxy for ``<w:body>`` element in this document, having primarily a
    container role.
    """

    def __init__(self, body_elm, parent):
        super(_Body, self).__init__(body_elm, parent)
        self._body = body_elm
        self._parent = parent

    def clear_content(self):
        """
        Return this |_Body| instance after clearing it of all content.
        Section properties for the main document story, if present, are
        preserved.
        """
        self._body.clear_content()
        return self

    def add_bookmarkStart(self, name):
        bookmarkStart = self._body._add_bookmarkStart()
        bookmarkStart.name = name
        bookmarkStart.bmrk_id = self._parent._Bookmarks._next_id
        return Bookmark(bookmarkStart)

    def add_bookmarkEnd(self):
        open_id = self._parent._Bookmarks._unclosed_bookmark
        bookmarkEnd = self._body._add_bookmarkEnd()
        bookmarkEnd.bmrk_id = open_id
        return Bookmark(bookmarkEnd)

and the Bookmarks class:

class Bookmarks(Sequence):
    def __init__(self, document_elm):
        super(Bookmarks, self).__init__()
        self._document = self._element = document_elm

    def __iter__(self):
        for bookmarkStart in self._bookmarkStarts:
            yield Bookmark(bookmarkStart)

    def __getitem__(self, idx):
        bookmarkStart = self._bookmarkStarts[idx]
        return Bookmark(bookmarkStart)

    def __len__(self):
        return len(self._bookmarkStarts)

    def get(self, name, default=None):
        for bookmarkStart in self._bookmarkStarts:
            if bookmarkStart.name == name:
                return Bookmark(bookmarkStart)
        return default

    @property
    def _bookmarkStarts(self):
        return self._document.xpath('.//w:bookmarkStart')

    @property
    def _bookmarkEnds(self):
        return self._document.xpath('.//w:bookmarkEnd')

    @property
    def _next_id(self):
        return str(len(self._bookmarkStarts))

Benjamin-T commented 6 years ago

Hmm, I'm not sure about this solution. This checks for open bookmarks, but if there are two bookmarks open there is no way of choosing one of the two. Therefore, we might need to go for a add_bookmark_end(name='name_of_specific_bookmark')

This is the current code outline:

doc = Document()
bmark_obj = doc.bookmarks()
test = doc.start_bookmark('test_one')
par = doc.add_paragraph()
par.add_run('test')
doc.end_bookmark()
doc.start_bookmark('test_two')
par_2 = doc.add_paragraph()
par_2.add_run('test_line_2')
doc.end_bookmark()
doc.save('test.docx')

results in:

<w:document>
  <w:body>
    <w:bookmarkStart w:name="test_one" w:id="1" />
    <w:p>
      <w:r>
        <w:t>test</w:t>
      </w:r>
    </w:p>
    <w:bookmarkEnd w:id="1" />
    <w:bookmarkStart w:name="test_two" w:id="2" />
    <w:p>
      <w:r>
        <w:t>test_line_2</w:t>
      </w:r>
    </w:p>
    <w:bookmarkEnd w:id="2" />
    <w:sectPr w:rsidRPr="0006063C" w:rsidR="00FC693F" w:rsidSect="00034616">
      <w:pgSz w:w="12240" w:h="15840" />
      <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0" />
      <w:cols w:space="720" />
      <w:docGrid w:linePitch="360" />
    </w:sectPr>
  </w:body>
</w:document>

scanny commented 6 years ago

We should probably keep all the conversation in the pull request going forward, so it's all in one piece. I'll respond to these here but after that, let's do any new ones in the Conversations tab of the pull request :)

scanny commented 6 years ago

Okay, I don't completely understand your code for the "finding the ids" bit, but I'm thinking the solution is simpler than you think, basically that we use XPath to find the ids when we need them (and that we don't store them anywhere outside the XML).

I'm also not sure we need an "unclosed" test. I'm open to argumentation to the contrary, but I'm not seeing it yet. It looks to me like there are a certain number of w:bookmarkStart elements in the XML, each one represents a bookmark, and a bookmark can be either open or closed. We don't need to determine whether it's open or not unless someone cares to ask or if we need to do an operation that only works on a closed bookmark. We can certainly have an .is_closed property on Bookmark if we need it (I don't see yet that we do though) and we may very well want a ._bookmarkEnd property on Bookmark, although I haven't seen the use case for that yet (I suspect it's there).

I do think we're on the wrong side of YAGNI here, working with methods and properties we don't have a clear need for yet. At times like these, I always return to the use-case/protocol part of the analysis and narrow my focus down to something entirely concrete. Basically answer the question "As a developer using python-docx, I need to be able to {do what exactly?}"

Anyway, back to the "finding the bookmarkEnd elements" question.

So say we have a bookmarkStart and we want to find its matching bookmarkEnd. First of all, we should only do this in the oxml layer (not the proxy layer) and we'd do it something like this:

class CT_BookmarkStart(...):

    @property
    def matching_bookmarkEnd(self):
        """Return first `w:bookmarkEnd` element having matching id, None if not present."""
        root_element = self.getroottree().getroot() 
        matching_bookmarkEnds = root_element.xpath(
            './/w:bookmarkEnd[@w:id=\'%d\']' % self.id
        )
        if not matching_bookmarkEnds:
            return None
        return matching_bookmarkEnds[0]

If you're available for a call or a Google hangout, it might be a faster way to work through some of these initial concepts, let me know :)

python-openxml / python-docx

Feature: Add Bookmark #425

Bookmarks

XML:

bookmarkStart

bookmarkEnd

Initial proposal

docx.oxml.run

docx.text.run

Simple test code:

Feature

Functional tests:

Unittests:

To Do:

WordprocessingML - documentation

Requirements:

Custom Element Class

More detailed VBA function analysis

Bookmarks

Bookmark methods:

Bookmark properties:

Properties Included or Omitted

Bookmarks - Object

Properties

Methods:

Bookmark - Object

Attributes / properties

End - Returns or sets the ending character position of a selection, range, or bookmark. Read/write link

Methods / setters

Bookmarks - Object

Attributes / properties:

methods