Open Benjamin-T opened 7 years ago
Hi @Benjamin-T, yes, creating an analysis document (aka. enhancement proposal) is the right first step, and that is independently committable whether you go on to implement or not.
To do that, you should fork this repo and create a new branch based on master
called feature/captions
(or whatever is descriptive and short, but starts with feature/...
). Add the analysis document and create a pull request whenever you've gotten it far enough along. I'm happy to provide feedback.
Thanks for the reply @scanny Upon inspection of the code, I think there is a more fundamental feature required in order to make the captions and crossreferences work. They all seem to use a bookmark.
Therefore I would propose first to develop the bookmark code, and subsequently implement the captions and references.
Question: How do I add such a xml element? Should it be placed in the
class CT_R(BaseOxmlElement):
object?
Interesting link: openxml standard page 1032 - Bookmarks Quote of above document about the bookmarks:
Within a WordprocessingML document, bookmarks refer to arbitrary regions of content which are bounded and have a unique name associated with them. Because bookmarks are a legacy word processing function which predates the concepts of XML and wellformedness, they can start and end at any location within a document's contents and therefore must use the "cross-structure" annotation format described in §2.13.2. [Example: Consider the following WordprocessingML markup for two paragraphs, each reading Example Text, where a bookmark has been added spanning the second word in paragraph one and the first word in paragraph two. The bookmarkStart and bookmarkEnd elements (§2.13.6.2; §2.13.6.1) specify the location where the bookmark starts and ends, but cannot contain it using a single tag because it spans part of two paragraphs. However, the two tags are part of one group because the id attribute value specifies 0 for both. end example] The example:
<w:p>
<w:r>
<w:t>Example</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="sampleBookmark" />
<w:r>
<w:t xml:space="preserve"> text.</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Example</w:t>
</w:r>
<w:bookmarkEnd w:id="0" />
<w:r>
<w:t xml:space="preserve"> text.</w:t>
</w:r>
</w:p>
This element specifies the start of a bookmark within a WordprocessingML document. This start marker is matched with the appropriately paired end marker by matching the value of the id attribute from the associated bookmarkEnd element. If no bookmarkEnd element exists subsequent to this element in document order with a matching id attribute value, then this element is ignored and no bookmark is present in the document with this name. If a bookmark begins and ends within a single table, it is possible for that bookmark to cover discontiguous parts of that table which are logically related (e.g. a single column in a table). This type of placement for a bookmark is accomplished (and described in detail) on the colFirst and colLast attributes on this element. [Example: Consider a document with a bookmark which spans half of paragraph one, and part of paragraph two. The bookmarkStart element specifies the start of the region for the testing123 bookmark. This element is then linked to the bookmarkEnd element which also has an id attribute value of 0. end example]
This element specifies the end of a bookmark within a WordprocessingML document. This end marker is matched with the appropriately paired start marker by matching the value of the id attribute from the associated bookmarkStart element. If no bookmarkStart element exists prior to this element in document order with a matching id attribute value, then this element is ignored and no bookmark is present in the document with this name. [Example: Consider a document with a bookmark which spans half of paragraph one, and part of paragraph two. The bookmarkEnd element specifies the end of the region for the bookmark whose bookmarkStart element has an id attribute value of 0. In this case, this refers to the testing123 bookmark. end example]
The following WordprocessingML illustrates an example of content which fufills this constraint:
<w:p>
<w:r>
<w:t xml:space="preserve">This is sentence one.</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="testing123"/>
<w:r>
<w:t>This is sentence two.</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t xml:space="preserve">This </w:t>
</w:r>
<w:bookmarkEnd w:id="0"/>
<w:r>
<w:t>is sentence three.</w:t>
</w:r>
</w:p>
I would place the method 'add_bookmark()' in the
file:
docx.text.run.py
and the xml object in the docx.oxml.run.py
the bookmarkStart element:
from docx.oxml import OxmlElement
def add_bookmark_start(self, name='test_bookmark'):
bmrk = OxmlElement('w:bookmarkStart')
bmrk.set(qn('w:id'), '1')
bmrk.set(qn('w:name'), name)
return bmrk
The bookmarkEnd element:
def add_bookmark_end(self):
bmrk = OxmlElement('w:bookmarkEnd')
bmrk.set(qn('w:id'), '1')
return bmrk
Challenges:
In the docx.text.run.py
file the following method should be added:
def add_bookmark(self, name='test_bookmark'):
self._r.append(self._r.add_bookmark_start(name=name))
self._r.append(self._r.add_bookmark_end())
import docx
test = docx.Document()
par = test.add_paragraph()
test_run = par.add_run()
test_run.add_text('Test')
test_run.add_bookmark()
test.save('Test_bookmark.docx')
I've never worked with behave, but this is my first shot at a feature text:
The feature text:
Feature: Add bookmark to a run
In order to add a bookmark to my run
As a developer using python-docx
I need a way to add a bookmark to a run
Scenario: Add a bookmark
Given a run
When I add a bookmark
Then the bookmark appears in that location in the run.
After some googling, I understand running tests is done like this:
Regarding how to add an element, in python-docx
, each new element type is represented by its own custom element class.
Each custom element class is named after the XML Schema type definition it represents. For example, a <w:r>
(run) element is defined in the schema by definition named CT_R
("CT_" standing for "complex type"), so its custom element class is named CT_R
.
Each tag name is mapped to its custom element class in pptx/oxml/__init__.py
. More than one tag name can be mapped to each custom element class.
The custom element class is located with the other element classes it coheres most strongly with. This is a matter of judgement but I rarely find myself with much doubt about where one should go (or where to look for it later).
Each custom element class subclasses BaseOxmlElement
. This is the metaclass hierarchy that allows standard properties and methods to be defined declaratively, for example with ZeroOrOne()
or OptionalAttribute()
. There are a lot of examples of these throughout the pptx.oxml
subpackage, let me know with any specific questions after exploring some of those to see the patterns.
The metaclass provides standard methods and properties such as get_or_add_{x}()
, _remove_{x}()
and {x} (where x
is the name of the class variable to which the results of say ZeroOrOne(...)
is assigned to in a custom element class definition. More on this later if you need it.
Custom element classes are not tested in isolation (except in certain unusual cases); rather they are tested integrated with the code that calls them by providing starting XML and expected XML. You'll see many examples of this in the unit tests. Look for calls to element()
and xml()
in the unit test fixtures; these are setting up before and after element/XML objects respectively.
Regarding the API/protocol for this, we'll need to answer a couple XML Schema-related questions, but the call to add a bookmark will definitely not be on a Run
object (a bookmark element is a peer to a run, not its child). It could be on a Paragraph
object, but may need to go higher, depending on what the possible "parent" elements of a bookmark start and end are.
I'd be inclined (but still in brainstorming mode) to something like:
bookmark = paragraph.start_bookmark('my bookmark name')
# ... add more runs, maybe more paragraphs
same_or_later_paragraph.end_bookmark(bookmark)
A couple notes:
I'm thinking the object named "bookmark" above is actually something like a BookmarkMemo
object, or something like that. I don't believe it will actually be a bookmark. I think we will have bookmarks later, perhaps something like Document.bookmarks
, where you want to look one up by name or something and then do things with it. But that's what analysis is for, we'll see how that develops as the rest of the bookmarking API emerges.
I'll bet the bookmark IDs have to be unique, but the user shouldn't have to keep track of those. We'll just assign the next available unique one on the .start_bookmark()
call. One virtue of the .end_bookmark(bookmark)
interface is the ID is available from bookmark
without the caller ever having to know what it is.
You'll need to search the XML Schema to see what the possible parents of bookmark elements are. I'm willing to bet it's not just paragraphs.
Regarding acceptance tests, these can only exercise the API, and therefore can only test what the API allows us to detect.
A consequence of this is we need the "read" API before (usually just before) we need the "write" API.
So you probably want to be thinking of how you find bookmarks in a document and what one would want to do with them once found.
This is about the time I start taking a close look at the Microsoft Visual Basic API for Word to see how they're handled there. A Google search on 'vba word bookmark' is also a worthwhile exercise, to get a sense for the VBA calling protocol for bookmarks. This will often uncover unexpected complexities that would be very expensive to discover late in the process.
https://msdn.microsoft.com/en-us/VBA/Word-VBA/articles/object-model-word-vba-reference
First of all, thanks for the detailed response @scanny. I did not give the general implementation much thought, but I think it makes this project a bit more challenging. I've briefly looked into the XML reference, and you're right, there are a lot of different parent objects.
Apparently, the bookmark XML predates the concepts of the XML:
Within a WordprocessingML document, bookmarks refer to arbitrary regions of content which are bounded and have a unique name associated with them. Because bookmarks are a legacy word processing function which predates the concepts of XML and wellformedness, they can start and end at any location within a document's contents and therefore must use the "cross-structure" annotation format described in §2.13.2.
This makes this certainly an even more interesting challenge, the only thing that is important is that a bookmarkStart element is joined with a bookmarkEnd element with an equal 'id'-attribute value.
They present the following example, of a bookmark spanning a paragraph:
<w:p>
<w:r>
<w:t>Example</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="sampleBookmark" />
<w:r>
<w:t xml:space="preserve"> text.</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Example</w:t>
</w:r>
<w:bookmarkEnd w:id="0" />
<w:r>
<w:t xml:space="preserve"> text.</w:t>
</w:r>
</w:p>
It seems that therefore the element should be available at 'any' point. The document lists the possible parent objects as well. Its a large list at first, but a lot of it isn't implemented (yet) in the docx package. I've listed the possible parents, and grouped them a bit, sorted them based on the level and likelihood of implementation. - Ironically, the run object is practically the only object in the XML structure that isn't a possible parent of the bookmark object.
Element | Name | Element Included? |
---|---|---|
body | body | yes |
p | paragraph | yes |
tbl | Table object | yes |
tc | Table Cell | yes |
tr | Table Row | yes |
endnote | Endnote Content | no |
footnote | footnote content | no |
ftr | footer content | no |
hdr | header content | no |
hyperlink | hyperlink | no |
fldSimple | Simple field, determined by fieldcode | no |
comment | comment | no |
del | deleted run content | no |
ins | inserted run content (like tracked changes) | no |
moveFrom | Move Source Run Content (i.e. tracked changes) | no |
moveTo | Move Destination Run Content (i.e. tracked changes) | no |
customXml | cell level custom xml | no |
customXml | row level custom xml | no |
customXml | inline-level custom xml | no |
customXml | block level custom xml | no |
deg | math sign, degree | no |
den | denominator (math) | no |
e | specified base arg. of math func | no |
lim | math, function name limit | no |
num | math, numerator | no |
oMath | math, Office Math | no |
fName | math, function apply obj. like sin, cos | no |
sub | Subscript (Pre-Sub-Superscript) | no? |
sup | Superscript (Superscript function) | no? |
docPartBody | Contents of Glossary Document Entry | no |
txbxContent | Rich Text Box Content Container | no |
smartTag | specifies presence of a tag around inline structureslno | |
sdtContent | Block-Level Structured Document Tag Content | no? |
sdtContent | Cell-Level Structured Document Tag Content | no? |
sdtContent | Row-Level Structured Document Tag Content | no? |
sdtContent | Inline-Level Structured Document Tag Content | no? |
rt | Phonetic Guide Text | no |
rubyBase | Phonetic Guide Base Text | no |
For now just thought of two, probably more.
For the bookmark to work, we would need two custom element classes:
CT_BookmarkStart
and CT_BookmarkEnd
.
I've renamed the feature to add_bookmark only. As it looks like it's a better idea to first fully focus on the correct implementation of the bookmark object, and afterwards see how things like captions and cross references interact with the bookmarks.
Like you recommended, I've investigated the VBA bookmark documentation. This provided some more background and insight about what kind of behavior and functionality is expected from the bookmark object.
A collection of bookmark object that represent the bookmarks in the specified selection, range, or document. Also, the object returns the bookmark corresponding to the index, which is either a key or a number. The numbered index can be used, and the named indices can apparently be sorted alphabetically.
There is also a set of predefined bookmarks - not sure whether these can be accessed by the package.
There are three methods described:
There is a total of ten properties:
I think features 1, 3, 10 can be omitted, and I'm unsure about 8. This leaves features, 2, 4, 5, 6, 7, 8, 9. Which is already a nice list.
Correct me if I'm wrong, but can I conclude from this that there should be some Container object for bookmarks?
Continuing on the bookmarks 'container' object, it has the following properties and methods:
.Add( Name, Range)
Name: The name of the bookmark. The name cannot be more than 40 characters or include more than one word.
Range: The range of text marked by the bookmark. A bookmark can be set to a collapsed range (the insertion point)Okay, regarding the parentage aspect, I'm thinking the key parent elements to start with are w:body
and w:p
. Maybe the table ones come in later, I'll have to noodle that a bit more, but table cells contain paragraphs, so you'd already be able to put one inside a cell, and I can't think of any super-common uses for a bookmarked cell, row, or table at the moment.
One possibility is to implement the bookmark functionality as a mixin, so it can be added to multiple "locations" without duplicating the code. That would also provide a uniform interface, regardless of the parent object. Maybe we call that BookmarkParentMixin
or maybe something a little better. That approach would also make it a lot easier to add different "parents" later on.
Regarding the MS-API for bookmarks:
Bookmarks
collection available on all of Document
, Range
, and Selection
. I think we only need Document
for now. We don't have the concept of Range
or Selection
in python-docx
. (Range
we may end up having at some point, Selection
probably not because it corresponds to the state of a running Word application.).copy()
method. I'm open to opposing views, but I don't see the immediate use case..delete()
or perhaps .remove()
in the "maybe later" category, unless you see a clear use case for it. It doesn't seem like it would be hard to implement, but I think we want to jealously limit scope for the time being. These things have a tendency to grow beyond one's ambition to complete them :).select()
method doesn't make any sense for python-pptx
for the "state in running application" reason I mentioned earlier..name
, possibly also including .parent
, and maybe .empty
if we think anyone cares about those two. .Application
, .Creator
, .End
, .Range
, and .Start
all refer to concepts that don't exist in python-docx
. .StoryType
might be possible but I just don't see an active use for that. So my list shortens to 6, with 4 and 7 as possibly nice-to-haves.python-docx
though, leaves use without any interface to get to the "contents" of a bookmark. That probably bears some thought. We could implement Bookmark.paragraphs
, although we'd have to work out what to do with a "partial" first or last paragraph. This one should go on the open noodle list.Bookmarks
object, available, I'm thinking, on Document.bookmarks
. I'm inclined to believe it would need hybrid semantics, like some combination of sequence and mapping (list and dict). It would need to be iterable, it would have a length (support .len()
), and I'm thinking indexed access (e.g. bookmarks[7]) would have sequence semantics. Key lookup (by name) could be with a .get()
method, with the same interface as that method on a dict (e.g. optional second parameter to specify default return value). We could think about a "lookup by ID" method if we thought that would be needed, maybe .get_by_id(id)
. Along those lines, the analysis should determine whether bookmark names are enforced unique and what the lookup behavior is when two have the same name. I think it's safe to assume bookmark IDs are unique.Regarding the MS-API for the Bookmarks collection:
.Application
and .Creator
are not applicable. .Creator
could possibly be used, like somehow assigning an identifier specific to python-docx
, but we've never done that before and I don't see the compelling use case for it so far.Bookmarks.Count
maps to len(Bookmarks)
, of course..Parent
we can skip as the parent is always the Document
object so far, which they just used to get the Bookmarks
object..ShowHidden
I don't think we can do because we don't track hidden parts of the document (and I'm not exactly sure what hidden status actually is).As for the methods:
.Add()
we can't do because we don't have ranges. We'll have the .start_bookmark()
and .end_bookmark()
methods folks can use for this job (or whatever we end up deciding on)..Exists()
- I'm not sure of the semantics for this one, but if it's just finding out if a bookmark with a certain name exists, folks can do that easily enough with something like:
bookmark = Bookmarks.get(name)
if bookmark is None:
# do False thing
else:
# do True thing
.Item()
is just indexed access.So in short, I think we're good with a Bookmarks
object that has the behaviors I outlined in the last bullet of the previous comment.
Allright, summarizing: If I understand correctly we're going to develop a Bookmarks
object, which will contain a bookmark object.
the Bookmark
object will initially only contain the following properties / attributes:
These could be added in the 'future'.
The empty
property could come in handy when one wants to refer to the text in the bookmark.
Both the parent and column properties can be included when the bookmark is extended to include the table, since this introduces a possible differen parent? (in its current form, it will always be the document, but in the case of the table, it will be the table object, right? )
start_bookmark
end_bookmark
bookmark text
- not sure where to put this, but the bookmark can contain a specific text.The bookmarks object will be placed within the document
object.
add_bookmark
- returns a bookmark object which has the startstart_bookmark
and end_bookmark
methods. I'm currently trying to develop some kind of basic version of the above.
Without thinking about the python-docx package too much, I had something like this in mind. I'm familiar with a lot of the super()
and metaclass concepts, therefore implementing this will be a challenge ;)
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
class Bookmark(object):
def __init__(self):
self.bmrk_start = OxmlElement('w:bookmarkStart')
self.bmrk_end = OxmlElement('w:bookmarkEnd')
self.name = None
self.id = None
def set_id(self, id):
self.id = id
self.bmrk_start.set(qn('w:id'), str(id))
self.bmrk_end.set(qn('w:id'), str(id))
return self
def set_name(self, name):
self.name = name
return self
def start_bookmark(self):
return self.bmrk_start
def end_bookmark(self):
return self.bmrk_en
class Bookmarks(object):
def __init__(self):
self.bookmarks = []
self.mapping = []
def __iter__(self):
for bookmark in self.bookmarks:
yield bookmark
@property
def count(self):
return len(self.bookmarks)
def add_bookmark(self, name):
idx = self.count + 1
bm = Bookmark().set_id(idx)
bm.set_name(name)
self.bookmarks.append(bm)
self.mapping.append(name)
return bm
def get_bookmark(self, name):
if name in self.mapping:
return self.mapping.index(name)
def item(self, idx):
return self.bookmarks[idx]
Okay, so lets start with your proposed interface for Bookmark
.
I don't see how you're going to implement a .start
or .end
property, at least not with the same semantics as those in the MS Bookmark
object. These property values are character positions in the document, meaning an integer count of the document characters before that location (e.g. 1042). This would require exhaustively visiting each prior run to determine the number of characters it contains. It would also require analysis to figure out which runs should be skipped, like maybe they're hidden or deleted or whatever and maybe those characters count or maybe they don't. Also you'd need to work out how multi-byte characters are counted.
Implementation aside though, I'm not sure what anyone would do with those values if they were available. I think these two are clearly on the "skip" list.
I'm not sure I buy your argument about having a .empty
property, but it does make me think it could be very useful to have a .text
property. This might be enough of a first capability to support the common use cases we can foresee without getting into the whole .Range
challenge.
Overall, I think this is the moment we really need to hold tightly to the YAGNI principle (you aren't going to need it). Rather we should be driven by concrete use cases and how they can be supported. Whenever we utter the phrase "would be handy" or "would be nice to have" we need to be suspect. If we keep things simple, we can always come back later to elaborate things. If we bite off too much it's very likely to bog down and stall.
With that in mind, I think the appropriate step here is to explore the protocol, which is basically the calling sequence to get something useful done. Here's an example for dropping a bookmark at the beginning of a document:
>>> document = Document()
>>> bookmark = document.start_bookmark('Planted Flag')
>>> document.add_paragraph('We wuz here!')
>>> document.end_bookmark(bookmark)
>>> bookmark.bookmark_id
1
>>> bookmark.name
'Planted Flag'
>>> bookmark.text
'We wuz here!'
Although we've pushed the captions and so on out of scope for this implementation, perhaps they're good use cases we can use to explore the proposed interfaces we're working on here.
Regarding implementation details, let's start at the "outside" with Document.bookmarks
:
class Document(...):
# ---other properties and methods ...
@lazyproperty
def bookmarks(self):
return Bookmarks(self._element)
I don't know if you're familiar with the @lazyproperty
decorator. It's not an official one, but I do find that I use it in all my projects. It's defined in docx.shared
or someplace like that. It borrows from functional programming, like Haskell, that this property is a fixed (immutable) value, but that value is only calculated on first use (hence the term "lazy"). This has a number of virtues, but an important one is that it provides a way to avoid doing "real work" in the constructor (__init__()
method). In this case notice there's no self.bookmarks = []
or whatever in the constructor.
Also very important here is that we avoid primitive obsession, meaning implementing a bookmarks collection using a primitive object like list
(or dict
). There are a lot of good reasons for this that Google will look up for you by that term.
The Bookmarks
object needs a reference to the <w:document>
element so it can locate and iterate over all the bookmark elements, etc. That's what self._element
is.
So on to the Bookmarks
class:
from collections import Sequence
class Bookmarks(Sequence):
def __init__(self, document_elm):
super(Bookmarks, self).__init__()
self._document = self._element = self.document_elm
def __iter__(self):
for bookmarkStart in self._bookmarkStarts:
yield Bookmark(bookmarkStart)
def __getitem__(self, idx):
bookmarkStart = self._bookmarkStarts[idx]
return Bookmark(bookmarkStart)
def __len__(self):
return len(self._bookmarkStarts)
def get(self, name, default=None):
for bookmarkStart in self._bookmarkStarts:
if bookmarkStart.name == name:
return Bookmark(bookmarkStart)
return default
@property
def _bookmarkStarts(self):
return self._document.xpath('.//w:bookmarkStart')
Bookmarks
subclasses (built-in) Sequence
, which gives it many of the nice-to-have sequence behaviors like slicing, reversed, contains, etc.Sequence
base class requires that we implement .__getitem__()
and .__len__()
. I've added __iter__()
just because the default implementation isn't terrifically performant..count()
method, just because it's more pythonic to use the built-in len()
function and we need the .__len__()
method anyway, although the pythonic argument is compelling enough for me by itself in this case..add_bookmark()
method on Bookmarks
. We have the start and end bookmark methods on Document and I don't immediately see another way to add a bookmark, unless we have something like .bookmark_paragraph(paragraph)
or .bookmark_run(run)
or something. This probably bears further discussion as the start and end interface only works at the end of the document (when generating) and some folks are likely to want to insert a bookmark in the middle somewhere..get()
instead of .get_bookmark()
. .get()
is a standared dict
interface method and we're giving it the same behaviors here, so better to stick to convention (principle of least surprise)..item(n)
method is particular to Visual Basic. Here in Python we just use bookmarks[n]
to accomplish the same thing. This functionality is provided by the .__getitem__()
magic method.I'm out of time right now, let's leave it there and let me know how you digest this chunk, then we can get into the Bookmark
class and what the oxml
layer beneath it looks like. I think you'll find that pretty elegant and easy to understand once you see it more closely.
First of all, thanks for all the ideas and input. I really feel like I'm learning a lot here. And I really appreciate you taking the time to explain me some of the concepts.
I know its a bit too soon, but I really wanted to give it a go, so I've tried to create the CT_BookmarkStart
and CT_bookmarkEnd
elements. I was not sure where to put them, so I created the bookmarks.py
file.
If you could please take a look and see if I'm on the right track? I really struggled with understanding how these CT_....
objects were supposed to be created. So I ended up with the ones I've commited. The code works, but that's not the only goal here ;)
With regard to the Bookmarks object, I really think the first try you posted is an elegant way to create a list-like
object with a lot of nice functionalities.
I haven't taken a look at the lazyproperty decorator yet.
I also haven't tried to actually implement your first Bookmarks object. Maybe I get around doing this upcoming week.
If you want to submit code for review, a pull request (PR) is the best way to do it. You might need to to a little research first. The key thing is to be able to make and submit changes without creating a new pull request each time. Basically, the way you do this is creating a new branch, based on master (on your fork) and making that branch the base of the PR. When you make changes, you rebase that branch to suit, and the PR is automatically updated (when you push that branch).
If you're not familiar with rebasing with Git you'll need to study up; it's an essential skill for this kind of interaction.
Then I can make comments directly on the code changes you submitted; you can then make changes, rebase the branch and push, and submit the revised code.
Regarding your first stab:
I think bookmark.py
is fine for now. We might change it to fields.py
or something later, depending on what else ends up in there, but a renaming is easily done once things become more clear.
You'll want the XML Schema extract in the analysis document before you do too much custom element class definition. The custom element class is essentially an implementation of the XML Schema definition for that/those XML elements. I can't tell if it's correct without reference to the related XML Schema excerpts.
The class names should exactly match their XML Schema counterparts. So it's CT_BookmarkStart
, not CT_bookmarkStart
.
You don't need set_x
methods on a custom element class. The metaclass adds the properties and methods you need just by you adding the x = OneOrMore('p:x', ...)
etc. class variables. So if you want to set the id
value, you just sah bookmarkStart.id = 42
. This is alot like how SQLAlchemy and Django work in their declarative aspects and represents the magic of MetaClasses.
I'm not sure what you intended with the _bookmark()
methods, but those don't need to be there. At this point, neither of those two classes should have any properties or methods other than those they get from the metaclass.
I'd recommend you focus on getting the pull request bit working and show me where you are on the analysis document, in particular the XML Schema excerpts, and then we can come back to these two definitions.
Btw, there's nothing wrong (in my mind) with jumping into the implementation, just to get straight in your mind how it's all going to work. I do that all the time (I call it 'spiking'). The only problem is thinking you're done once you have it working. That's where most folks stop and that's why they don't get a merge. From there you are in a good position to develop the acceptance and unit tests, flesh out the documentation required, and form it into a series of coherent commits that can be confidently committed without a lot of rework. All those things are required to actually add the feature without degrading the existing code base.
Allright! thanks for the feedback, I guess i have some studying to do.
I got rid of the set methods, however, the CT_BookmarkEnd and CT_BookmarkStart names do not rename to 'w:bookmarkEnd' and 'w:bookmarkStart'. But I dont understand why and how I can change this?
class CT_BookmarkStart(BaseOxmlElement):
"""
``<w:bookmarkStart>`` element
"""
bookmark = OneOrMore('w:bookmarkStart', ST_String)
name = RequiredAttribute('w:name', ST_String)
bmrk_id = RequiredAttribute('w:id', ST_RelationshipId)
def add_bookmark_start(self, id, name='test_bookmark'):
bmrk = CT_BookmarkStart()
bmrk.bookmark = 'w:bookmarkStart'
bmrk.bmrk_id = str(id)
bmrk.name = name
self.append(bmrk)
return bmrk
This leads to the following xml:
<w:p>
<w:r>
<w:t>Hello</w:t>
</w:r>
<CT_BookmarkStart w:id="2" w:name="text_to_be_bookmarked" />
<w:r>
<w:t>Hello</w:t>
</w:r>
<CT_BookmarkEnd w:id="2" />
</w:p>
Clearly Im overlooking something...
After hours of puzzeling and camparing code, I got it working!
the 'catch' was 'registering' the bookmarkStart elements as xml elements using the self._add_bookmarkstart()
method in the paragraph object.
I'also created a pull request like you asked; I purposly committed the 'removed obsolete code' commit to check if the pull request is automatically updated with the new code. Let me know if I need to change some settings.
the \oxml\paragraph.py
file:
class CT_P(BaseOxmlElement):
"""
``<w:p>`` element, containing the properties and text for a paragraph.
"""
pPr = ZeroOrOne('w:pPr')
r = ZeroOrMore('w:r')
bookmarkStart = ZeroOrMore('w:bookmarkStart')
bookmarkEnd = ZeroOrMore('w:bookmarkEnd')
the text\paragraph.py
file:
from __future__ import (
absolute_import, division, print_function, unicode_literals
)
from ..enum.style import WD_STYLE_TYPE
from .parfmt import ParagraphFormat
from .run import Run
from ..shared import Parented
from docx.shared import ElementProxy
from docx.oxml.bookmark import CT_BookmarkStart, CT_BookmarkEnd
class Paragraph(Parented):
"""
Proxy object wrapping ``<w:p>`` element.
"""
def __init__(self, p, parent):
super(Paragraph, self).__init__(parent)
self._p = self._element = p
def add_bookmark_start(self, id, name):
bmrk = self._p._add_bookmarkStart()
bmrk.name = name
bmrk.bmrk_id = str(id)
return self._p.append(bmrk)
def add_bookmark_end(self, id):
bmrk = self._p._add_bookmarkEnd()
bmrk.bmrk_id =str(id)
return self._p.append(bmrk)
and the elements:
class CT_BookmarkStart(BaseOxmlElement):
"""
``<w:bookmarkStart>`` element
"""
name = RequiredAttribute('w:name', ST_String)
bmrk_id = RequiredAttribute('w:id', ST_RelationshipId)
class CT_BookmarkEnd(BaseOxmlElement):
"""
The ``<w:bookmarkEnd>`` element
"""
bmrk_id = RequiredAttribute('w:id', ST_RelationshipId)
Since I finally got a half decent version working, I'll start focussing more on the development of the analysis documentation. Also I just found out that 'location' of the feature is within the CT_Body
and not in the CT_P
I've currently placed it.
The reason I placed it in the paragraph type is that the bookmarks are placed in a paragraph. Therefore i assumed that that would be the logical location.
This is interesting;
the element type of the bookmarkEnd element is CT_MarkupRange
<xsd:element name="bookmarkStart" type="CT_Bookmark"/>
<xsd:element name="bookmarkEnd" type="CT_MarkupRange"/>
<xsd:complexType name="CT_Body"> <!-- denormalized -->
<xsd:sequence>
<xsd:choice minOccurs="0" maxOccurs="unbounded">
<xsd:element name="p" type="CT_P"/>
<xsd:element name="tbl" type="CT_Tbl"/>
<xsd:element name="sdt" type="CT_SdtBlock"/>
<xsd:element name="customXml" type="CT_CustomXmlBlock"/>
<xsd:element name="altChunk" type="CT_AltChunk"/>
<xsd:element name="proofErr" type="CT_ProofErr"/>
<xsd:element name="permStart" type="CT_PermStart"/>
<xsd:element name="permEnd" type="CT_Perm"/>
<xsd:element name="bookmarkStart" type="CT_Bookmark"/>
<xsd:element name="bookmarkEnd" type="CT_MarkupRange"/>
<xsd:element name="moveFromRangeStart" type="CT_MoveBookmark"/>
<xsd:element name="moveFromRangeEnd" type="CT_MarkupRange"/>
<xsd:element name="moveToRangeStart" type="CT_MoveBookmark"/>
<xsd:element name="moveToRangeEnd" type="CT_MarkupRange"/>
<xsd:element name="commentRangeStart" type="CT_MarkupRange"/>
<xsd:element name="commentRangeEnd" type="CT_MarkupRange"/>
<xsd:element name="customXmlInsRangeStart" type="CT_TrackChange"/>
<xsd:element name="customXmlInsRangeEnd" type="CT_Markup"/>
<xsd:element name="customXmlDelRangeStart" type="CT_TrackChange"/>
<xsd:element name="customXmlDelRangeEnd" type="CT_Markup"/>
<xsd:element name="customXmlMoveFromRangeStart" type="CT_TrackChange"/>
<xsd:element name="customXmlMoveFromRangeEnd" type="CT_Markup"/>
<xsd:element name="customXmlMoveToRangeStart" type="CT_TrackChange"/>
<xsd:element name="customXmlMoveToRangeEnd" type="CT_Markup"/>
<xsd:element name="ins" type="CT_RunTrackChange"/>
<xsd:element name="del" type="CT_RunTrackChange"/>
<xsd:element name="moveFrom" type="CT_RunTrackChange"/>
<xsd:element name="moveTo" type="CT_RunTrackChange"/>
<xsd:element ref="m:oMathPara"/>
<xsd:element ref="m:oMath"/>
</xsd:choice>
<xsd:element name="sectPr" minOccurs="0" maxOccurs="1" type="CT_SectPr"/>
</xsd:sequence>
</xsd:complexType>
from the wml.xsd excerpts:
<xsd:complexType name="CT_BookmarkRange">
<xsd:complexContent>
<xsd:extension base="CT_MarkupRange">
<xsd:attribute name="colFirst" type="ST_DecimalNumber" use="optional">
<xsd:annotation>
<xsd:documentation>First Table Column Covered By Bookmark</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
<xsd:attribute name="colLast" type="ST_DecimalNumber" use="optional">
<xsd:annotation>
<xsd:documentation>Last Table Column Covered By Bookmark</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:complexType name="CT_Bookmark">
<xsd:complexContent>
<xsd:extension base="CT_BookmarkRange">
<xsd:attribute name="name" type="ST_String" use="required">
<xsd:annotation>
<xsd:documentation>Bookmark Name</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
<xsd:group name="EG_RangeMarkupElements">
<xsd:choice>
<xsd:element name="bookmarkStart" type="CT_Bookmark">
<xsd:annotation>
<xsd:documentation>Bookmark Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="bookmarkEnd" type="CT_MarkupRange">
<xsd:annotation>
<xsd:documentation>Bookmark End</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="moveFromRangeStart" type="CT_MoveBookmark">
<xsd:annotation>
<xsd:documentation>Move Source Location Container - Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="moveFromRangeEnd" type="CT_MarkupRange">
<xsd:annotation>
<xsd:documentation>Move Source Location Container - End</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="moveToRangeStart" type="CT_MoveBookmark">
<xsd:annotation>
<xsd:documentation>Move Destination Location Container - Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="moveToRangeEnd" type="CT_MarkupRange">
<xsd:annotation>
<xsd:documentation>Move Destination Location Container - End</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="commentRangeStart" type="CT_MarkupRange">
<xsd:annotation>
<xsd:documentation>Comment Anchor Range Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="commentRangeEnd" type="CT_MarkupRange">
<xsd:annotation>
<xsd:documentation>Comment Anchor Range End</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="customXmlInsRangeStart" type="CT_TrackChange">
<xsd:annotation>
<xsd:documentation>Custom XML Markup Insertion Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="customXmlInsRangeEnd" type="CT_Markup">
<xsd:annotation>
<xsd:documentation>Custom XML Markup Insertion End</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="customXmlDelRangeStart" type="CT_TrackChange">
<xsd:annotation>
<xsd:documentation>Custom XML Markup Deletion Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="customXmlDelRangeEnd" type="CT_Markup">
<xsd:annotation>
<xsd:documentation>Custom XML Markup Deletion End</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="customXmlMoveFromRangeStart" type="CT_TrackChange">
<xsd:annotation>
<xsd:documentation>Custom XML Markup Move Source Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="customXmlMoveFromRangeEnd" type="CT_Markup">
<xsd:annotation>
<xsd:documentation>Custom XML Markup Move Source End</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="customXmlMoveToRangeStart" type="CT_TrackChange">
<xsd:annotation>
<xsd:documentation>Custom XML Markup Move Destination Location Start</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:element name="customXmlMoveToRangeEnd" type="CT_Markup">
<xsd:annotation>
<xsd:documentation>Custom XML Markup Move Destination Location End</xsd:documentation>
</xsd:annotation>
</xsd:element>
</xsd:choice>
</xsd:group>
Hmm, this is interesting. I haven't come across annotation and documentation strings in the schema before. Where did you pull these from?
The schema files I use are in the code tree here (and in the directory just above): https://github.com/python-openxml/python-docx/blob/master/ref/xsd/wml.xsd
Also I just found out that 'location' of the feature is within the CT_Body and not in the CT_P I've currently placed it.
The reason I placed it in the paragraph type is that the bookmarks are placed in a paragraph. Therefore i assumed that that would be the logical location.
Not sure what you mean by the "feature". This term is usually used here to indicate the API methods and properties added to implement some useful behavior.
But bookmarks appear to be part of both the body and paragraph objects and some others as well. So I'm not sure there's a clear "locale" for these, unlike most other objects in the code base.
A bookmark can be added to the body (document
from the API perspective) just as well as inside a paragraph, and it's possible I expect that one can start in one place and end in the other.
So something to noodle on :)
Hmm, this is interesting. I haven't come across annotation and documentation strings in the schema before. Where did you pull these from? The schema files I use are in the code tree here (and in the directory just above): https://github.com/python-openxml/python-docx/blob/master/ref/xsd/wml.xsd
this is a really interesting document:
Ah, of course, the ECMA spec, I had forgotten about that. It's been a while, so I'm not sure of the current status, but when I started the project I was using the ECMA spec until I started puzzling over why certain elements weren't in there or were named differently.
The short story (at the time at least) is that Microsoft uses the ISO spec (ISO/IEC 29500).
Anyway, the ECMA spec might be useful for reference, but I believe the ISO spec is still the definitive one. I'll be looking for XML Schema excerpts from the ISO spec/schemas.
@Scanny, I've studied up on the rebasing, and I think I got it figured out. It's a really neat trick! This makes collaboration a lot easier. Normally, I was always 'worried' about the number of commits for a certain fix, but with this rebasing you can combine / and reorder and hence restructure your work afterworks.
Having said that, I've taken a stab at the feature documentation. See pull request
With regard to the objects, we still had to figure out how to link the two parts of a Bookmark (bookmarkStart and bookmarkEnd) together, if there is like a paragraph in between.
My suggestion would be to make two ways to add a bookmark;
# This adds a paragraph containing the start and stop of the bookmark
document.add_bookmark(name='test', inplace=True)
# This places a bookmark start:
bookmark_obj_1 = document.add_bookmark(name='test_1')
bookmark_obj_1.start_bookmark()
# Start some new run, with text which maybe contains another bookmark:
par = document.add_paragraph()
bookmark_obj_2 = par.add_bookmark(name'test_2')
bookmark.start_bookmark()
# In this case, the bookmark contains the text, which could be referred to
par.add_run('sometext')
bookmark_obj_2.end_bookmark()
# Add the end of the first bookmark:
bookmark_obj_1.end_bookmark()
This does mean that one has to correctly place the bookmark object to be able to succesfully add the end the bookmark. I'm not entirely sure about the exact implementation. I think it has some challenges;
document.end_bookmark(name='test')
and this places the corresponding <w:bookmarkEnd>
element. I'll try to get an example up and running.
So, I've had my first struggles with git, but somehow managed to figure it out. I think it had something to do with the fact that I use both Sourcetree and the git shell commands. I think i forgot to fetch from all remotes and that is where the fun started..
What I've tried to do, is implement the Bookmarks and Bookmark object. And make them available in the document class. Currently they do generate a bookmark, but this bookmark is placed at the end of the Body.
using this code:
doc = Document()
asdf = doc.bookmark()
new_bookmark = asdf.add_bookmark()
new_bookmark.add_bookmark_start(name='test_1', id=1)
par = doc.add_paragraph()
par.add_run('test')
new_bookmark.add_bookmark_end(id=1)
new_bookmark_2 = asdf.add_bookmark()
new_bookmark_2.add_bookmark_start(name='test_2', id=2)
new_bookmark_2.add_bookmark_end(id=2)
doc.save('tab.docx')
<w:document>
<w:body>
<w:p>
<w:r>
<w:t>test</w:t>
</w:r>
</w:p>
<w:sectPr w:rsidRPr="0006063C" w:rsidR="00FC693F" w:rsidSect="00034616">
<w:pgSz w:w="12240" w:h="15840" />
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0" />
<w:cols w:space="720" />
<w:docGrid w:linePitch="360" />
</w:sectPr>
<w:bookmarkStart w:name="test_1" w:id="1" />
<w:bookmarkEnd w:id="1" />
<w:bookmarkStart w:name="test_2" w:id="2" />
<w:bookmarkEnd w:id="1" />
</w:body>
</w:document>
I hope I have some time tomorrow to figure it out, I'm thinking that the bookmark should be within a paragraph instance.
regards, Ben
Okay, so here's a little about what I think the code structure will look like. These sorts of things (proxy objects) all look the same in many respects. The term proxy object is just our name for objects like Paragraph
and Document
that "wrap" (proxy) an XML element, providing API calls that raise the abstraction level from the lxml/oxml
level up to something more suitable as a developer interface.
Let's start with the calling protocol I think we're after. This might bear more discussion, but this will do for a start:
>>> document = Document()
>>> bookmark = document.start_bookmark(name='test')
>>> document.add_paragraph('test bookmark body')
<Paragraph object at 0x...>
>>> document.end_bookmark(bookmark)
I've explicitly named the parameters, just for clarity. I expect there's no good reason not to allow them to be identified by position, e.g. .start_bookmark('test')
should be allowed. Later there will also need to be Paragraph.start_bookmark()
and Paragraph.end_bookmark()
and perhaps others, all of which can be mixed and matched (start in any, end in same or different ). But this is a good first step, much of which will be reusable or adaptable for the other cases.
So we'd want the code above to produce this XML:
<w:document>
<w:body>
<w:bookmarkStart w:name="test" w:id="1"/>
<w:p>
<w:r>
<w:t>test</w:t>
</w:r>
</w:p>
<w:bookmarkEnd w:id="1"/>
<w:sectPr w:rsidRPr="0006063C" w:rsidR="00FC693F" w:rsidSect="00034616">
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>
The first challenge is to get the bookmarkStart and End elements to appear above the sentinel w:sectPr
element. It looks like that can be done by adding this line to CT_Body
in docx.oxml.document
:
bookmarkStart = ZeroOrMore('w:bookmarkStart', successors=('w:sectPr',))
The implementation in docx.document._Body
would be something like this::
def start_bookmark(self, name):
bookmarkStart = self._body.add_bookmarkStart(name)
return Bookmark(bookmarkStart)
This leaves some of the work to other objects. In particular, the bookmark id must be resolved and assigned by CT_Body.add_bookmark()
. There is already an .add_bookmark()
method on CT_Body
, placed there by the metaclass as a consequence of the ZeroOrMore()
call above. That will need to be overridden to give it that further functionality. An alternative would be to use a different name like .add_bookmarkStart_with_next_id()
or something, but I typically just override. I'd have to look around for another element class that does that to what all needs to go in there.
In addition, I expect there needs to be a ._next_available_bookmark_id
helper property on CT_Body
that can take care of that little chore. .add_bookmarkStart()
would automatically assign the next available bookmark id and then it would be available from the bookmarkStart
element that's passed to the Bookmark
constructor.
This may be a good place to mention that a proxy object (like Bookmark
) should have no state of its own. All state is in the XML. So, for example, if you wanted to add a property Bookmark.id
, when called, it would go to the XML to find out and return that value. There would be no ._id
instance variable hanging around to get out of sync with the XML. Its implementation would be something like return self._element.id
.
Note that .end_bookmark()
is on Document
, and that another method of the same name will need to be on Paragraph
and anywhere else a bookmark can end. Just because it starts in Document
(under the w:body
element) doesn't mean it has to end there. But in either case, a bookmark can't end itself. It needs reference to a context element (e.g. w:body
, w:paragraph
, etc.) and so the most natural interface I think is on the proxy object for that element (e.g. Body
or Paragraph
respectively). I suppose an interface like Bookmark.start_on(paragraph)
and Bookmark.end_on(document)
is possible, but it's not screaming out to me so far as a superior design.
Does that give you something to noodle on for next steps?
@scanny, I've studied up on the rebasing, and I think I got it figured out. It's a really neat trick! This makes collaboration a lot easier. Normally, I was always 'worried' about the number of commits for a certain fix, but with this rebasing you can combine / and reorder and hence restructure your work afterworks.
It is a neat trick, isn't it! :)
I call when I'm getting the commits cleaned up "grooming" the commit history/sequence. It's not always worthwhile in every situation, but when you're collaborating it's a really nice way to make your commit(s) reflect a clear, single focus and be separately mergeable.
So I've succeeded in getting the elements above the sentinel object, but am now stuck on the item id's.
my initial thought is making the Bookmarks object available within the _Body object. This allows for getting 'new' id's. However, I'm still trying to figure out a neat solution for identifying bookmarkStarts without bookmarkEnds.
current approach:
_next_id
. (see below, the Bookmarks class)Struggling with finding the 'open' bookmarks: This should be a property or method of the Bookmarks class as well.
@property
def _unclosed_bookmarks(self):
closed = set(Bookmark(bmrk)._id for bmrk in self._bookmarkEnds)
open_b = set(Bookmark(bmrk)._id for bmrk in self._bookmarkStarts)
check = open_b - closed
if len(self._bookmarkStarts) == 0:
raise ValueError('BookmarkEnd requested without a bookmarkStart present')
if len(check) == 0:
raise ValueError('No open bookmark found')
else:
return check.pop()
This should work, however, if there are multiple open bookmarks, there are more id's and then there is no way of identifying the right one without specifying the corresponding bookmark name.
What are your thoughts on this?
The docx.document.py code:
class Bookmark(ElementProxy):
def __init__(self, doc_element):
super(Bookmark, self).__init__(doc_element)
self._element = doc_element
@property
def _id(self):
return self._element.bmrk_id
@property
def _name(self):
return self._element.name
class _Body(BlockItemContainer):
"""
Proxy for ``<w:body>`` element in this document, having primarily a
container role.
"""
def __init__(self, body_elm, parent):
super(_Body, self).__init__(body_elm, parent)
self._body = body_elm
self._parent = parent
def clear_content(self):
"""
Return this |_Body| instance after clearing it of all content.
Section properties for the main document story, if present, are
preserved.
"""
self._body.clear_content()
return self
def add_bookmarkStart(self, name):
bookmarkStart = self._body._add_bookmarkStart()
bookmarkStart.name = name
bookmarkStart.bmrk_id = self._parent._Bookmarks._next_id
return Bookmark(bookmarkStart)
def add_bookmarkEnd(self):
open_id = self._parent._Bookmarks._unclosed_bookmark
bookmarkEnd = self._body._add_bookmarkEnd()
bookmarkEnd.bmrk_id = open_id
return Bookmark(bookmarkEnd)
and the Bookmarks class:
class Bookmarks(Sequence):
def __init__(self, document_elm):
super(Bookmarks, self).__init__()
self._document = self._element = document_elm
def __iter__(self):
for bookmarkStart in self._bookmarkStarts:
yield Bookmark(bookmarkStart)
def __getitem__(self, idx):
bookmarkStart = self._bookmarkStarts[idx]
return Bookmark(bookmarkStart)
def __len__(self):
return len(self._bookmarkStarts)
def get(self, name, default=None):
for bookmarkStart in self._bookmarkStarts:
if bookmarkStart.name == name:
return Bookmark(bookmarkStart)
return default
@property
def _bookmarkStarts(self):
return self._document.xpath('.//w:bookmarkStart')
@property
def _bookmarkEnds(self):
return self._document.xpath('.//w:bookmarkEnd')
@property
def _next_id(self):
return str(len(self._bookmarkStarts))
Hmm, I'm not sure about this solution. This checks for open bookmarks, but if there are two bookmarks open there is no way of choosing one of the two. Therefore, we might need to go for a add_bookmark_end(name='name_of_specific_bookmark')
This is the current code outline:
doc = Document()
bmark_obj = doc.bookmarks()
test = doc.start_bookmark('test_one')
par = doc.add_paragraph()
par.add_run('test')
doc.end_bookmark()
doc.start_bookmark('test_two')
par_2 = doc.add_paragraph()
par_2.add_run('test_line_2')
doc.end_bookmark()
doc.save('test.docx')
results in:
<w:document>
<w:body>
<w:bookmarkStart w:name="test_one" w:id="1" />
<w:p>
<w:r>
<w:t>test</w:t>
</w:r>
</w:p>
<w:bookmarkEnd w:id="1" />
<w:bookmarkStart w:name="test_two" w:id="2" />
<w:p>
<w:r>
<w:t>test_line_2</w:t>
</w:r>
</w:p>
<w:bookmarkEnd w:id="2" />
<w:sectPr w:rsidRPr="0006063C" w:rsidR="00FC693F" w:rsidSect="00034616">
<w:pgSz w:w="12240" w:h="15840" />
<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0" />
<w:cols w:space="720" />
<w:docGrid w:linePitch="360" />
</w:sectPr>
</w:body>
</w:document>
We should probably keep all the conversation in the pull request going forward, so it's all in one piece. I'll respond to these here but after that, let's do any new ones in the Conversations tab of the pull request :)
Okay, I don't completely understand your code for the "finding the ids" bit, but I'm thinking the solution is simpler than you think, basically that we use XPath to find the ids when we need them (and that we don't store them anywhere outside the XML).
I'm also not sure we need an "unclosed" test. I'm open to argumentation to the contrary, but I'm not seeing it yet. It looks to me like there are a certain number of w:bookmarkStart
elements in the XML, each one represents a bookmark, and a bookmark can be either open or closed. We don't need to determine whether it's open or not unless someone cares to ask or if we need to do an operation that only works on a closed bookmark. We can certainly have an .is_closed
property on Bookmark
if we need it (I don't see yet that we do though) and we may very well want a ._bookmarkEnd
property on Bookmark
, although I haven't seen the use case for that yet (I suspect it's there).
I do think we're on the wrong side of YAGNI here, working with methods and properties we don't have a clear need for yet. At times like these, I always return to the use-case/protocol part of the analysis and narrow my focus down to something entirely concrete. Basically answer the question "As a developer using python-docx, I need to be able to {do what exactly?}"
Anyway, back to the "finding the bookmarkEnd elements" question.
So say we have a bookmarkStart
and we want to find its matching bookmarkEnd
. First of all, we should only do this in the oxml layer (not the proxy layer) and we'd do it something like this:
class CT_BookmarkStart(...):
@property
def matching_bookmarkEnd(self):
"""Return first `w:bookmarkEnd` element having matching id, None if not present."""
root_element = self.getroottree().getroot()
matching_bookmarkEnds = root_element.xpath(
'.//w:bookmarkEnd[@w:id=\'%d\']' % self.id
)
if not matching_bookmarkEnds:
return None
return matching_bookmarkEnds[0]
If you're available for a call or a Google hangout, it might be a faster way to work through some of these initial concepts, let me know :)
As multiple issues have stated before, it would be great if we the captions supported in python-docx. Previous issues already should temporary fixes to similar problems like the IndexedEntry (#137) and the figure caption in (#359). Also bookmarks have been investigated: (#109) As I understood from: feature analysis: section columns and Feature: Paragraph.add_hyperlink() (wip) Is that there are a few mandatory steps to take before a pull request is accepted:
Finally: Focus is key, one feature should only by a single functionality, to quote @scanny:
Now I've collected the available issues and pull request my question is, how to go from here. I would assume that we start at the beginning: Feature analysis
Adding the caption to a paragraph isn't directly the issue, the main goal is to create a feature which can be used together with a crossreference. Therefore the number of references or bookmarks that are created should be stored in an object. This object should be available to the caption method, in order to be support a sort of 'guessed figure/table number'. What I mean by this is best explained with an example:
The proposed solution in #359 works. Only he user first has to create a print view to force a fieldcode refresh to make the numbers visible. Therefore, it would be nice if the caption method could already place a "guessed" number. (Comparable as how Word behaves)