rmzelle / ref-extractor

Reference Extractor - Extract Zotero/Mendeley references from Microsoft Word files
https://rintze.zelle.me/ref-extractor/
MIT License
328 stars 20 forks source link

Handle citations divided over multiple w:instrText elements #21

Closed rmzelle closed 6 years ago

rmzelle commented 6 years ago

Per https://github.com/rmzelle/ref-extractor/pull/19#issuecomment-354619109 and https://github.com/rmzelle/ref-extractor/pull/19#issuecomment-354631709. Currently these citations aren't extracted.

It's not really clear to me when citations are split over multiple w:instrText elements. Also, for in-text author-date citations, it's not clear how one can determine which w:instrText elements belong together. Split elements seem to have a shared w:p element as ancestor in the documents I've seen, but I'm not a 100% sure this is always the case (e.g. for citations within figure or table captions, etc.). For footnotes and endnotes it looks like there is always an ancestor element of w:footnote and w:endnote one level higher, respectively, which makes things easier, but for author-date citations everything falls within a shared w:body.

zuphilip commented 6 years ago

As far as we have seen citations divided over multiple w:instrText elements shared a common w:p ancestor. Thus, this seems a good hypothesis to work with (more improvements or adjustments can always been done later).

There are some documentations about the docx format and especially about the fields in Word:

but I haven't looked closely at them.

rmzelle commented 6 years ago

Thanks! That first link is very helpful. Looks like it would be best to look for

<w:r>
  <w:fldChar w:fldCharType="begin"/>
</w:r>

and extract the contents from subsequent

<w:r>
  <w:instrText xml:space="preserve"> DATE </w:instrText>
</w:r>

nodes until we hit

<w:r>
  <w:fldChar w:fldCharType="end"/>
</w:r>

(given that the basic structure of a complex field seems to be

<w:r>
  <w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
  <w:instrText xml:space="preserve"> DATE </w:instrText>
</w:r>
<w:r>
  <w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r>
  <w:t>12/31/2005</w:t>
</w:r>
<w:r>
  <w:fldChar w:fldCharType="end"/>
</w:r>

(with one or more w:instrText elements))

rmzelle commented 6 years ago

But grouping w:instrText elements that have the same grandparent (always w:p, as far as we can tell), like you're suggesting, might work just as well.

rmzelle commented 6 years ago

I just added support for these split w:instrText elements with a solution that doesn't assume any particular grandparent element. (I just locate the initial w:r element and visit its siblings until I come across the closing w:r element)