Closed rmzelle closed 6 years ago
As far as we have seen citations divided over multiple w:instrText
elements shared a common w:p
ancestor. Thus, this seems a good hypothesis to work with (more improvements or adjustments can always been done later).
There are some documentations about the docx format and especially about the fields in Word:
but I haven't looked closely at them.
Thanks! That first link is very helpful. Looks like it would be best to look for
<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>
and extract the contents from subsequent
<w:r>
<w:instrText xml:space="preserve"> DATE </w:instrText>
</w:r>
nodes until we hit
<w:r>
<w:fldChar w:fldCharType="end"/>
</w:r>
(given that the basic structure of a complex field seems to be
<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> DATE </w:instrText>
</w:r>
<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r>
<w:t>12/31/2005</w:t>
</w:r>
<w:r>
<w:fldChar w:fldCharType="end"/>
</w:r>
(with one or more w:instrText
elements))
But grouping w:instrText
elements that have the same grandparent (always w:p
, as far as we can tell), like you're suggesting, might work just as well.
I just added support for these split w:instrText
elements with a solution that doesn't assume any particular grandparent element. (I just locate the initial w:r element and visit its siblings until I come across the closing w:r element)
Per https://github.com/rmzelle/ref-extractor/pull/19#issuecomment-354619109 and https://github.com/rmzelle/ref-extractor/pull/19#issuecomment-354631709. Currently these citations aren't extracted.
It's not really clear to me when citations are split over multiple w:instrText elements. Also, for in-text author-date citations, it's not clear how one can determine which w:instrText elements belong together. Split elements seem to have a shared w:p element as ancestor in the documents I've seen, but I'm not a 100% sure this is always the case (e.g. for citations within figure or table captions, etc.). For footnotes and endnotes it looks like there is always an ancestor element of w:footnote and w:endnote one level higher, respectively, which makes things easier, but for author-date citations everything falls within a shared w:body.