jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.04k stars 3.35k forks source link

Read docx references, and convert into pandoc citations #4140

Open jkr opened 6 years ago

jkr commented 6 years ago

Docx has a built in reference tracker, which keeps the references in an xml file in the document (and can change citation format on the fly). The docx reader could read these and convert them into pandoc citations.

Ideally, if we're going docx->markdown, we'd expect the citations to be similarly built into the markdown file. @jgm, do you think it would work to have this conversion produce yaml-block citation entries?

jgm commented 6 years ago

+++ Jesse Rosenthal [Dec 11 17 13:45 ]:

Ideally, if we're going docx->markdown, we'd expect the citations to be similarly built into the markdown file. [1]@jgm, do you think it would work to have this conversion produce yaml-block citation entries?

Sure, this could be done. It's just a matter of adding a references field to the metadata (or adding items to an existing references field).

jkr commented 6 years ago

Okay -- I'll take a look at how bearable the xml format is. Although I wonder how many people use this, vs zotero or endnote or whatever.

I'll also take a look around and see how much (a) people seem to use this feature, and (b) how easy it is to export citations into zotero/endnote/etc. If it looks useful, it might be worth considering putting it the writer (sort of as docx analogue to --biblatex/--natbib).

John MacFarlane notifications@github.com writes:

+++ Jesse Rosenthal [Dec 11 17 13:45 ]:

Ideally, if we're going docx->markdown, we'd expect the citations to be similarly built into the markdown file. [1]@jgm, do you think it would work to have this conversion produce yaml-block citation entries?

Sure, this could be done. It's just a matter of adding a references field to the metadata (or adding items to an existing references field).

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/4140#issuecomment-350761949

iandol commented 6 years ago

Hm, I know of no academics who use the Word Citations feature, it is very basic (manual entry) but quite elegantly integrated in Word, Endnote is sadly the king of the hill for most people. Being able to retain reference data from the reader into YAML metadata would be neat, but I suspect the number of users would be exceedingly small :-)

I do wonder if Endnote uses the same XML store for its travelling library (it can store refs in a word doc), if this was the case being able to import the references would become a much more useful feature!!! But I imagine Endnote deliberately separates them, I'll have a look at a sample docx and see...

iandol commented 6 years ago

Endnote places all reference data in document.xml directly as escaped xml:

<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:r><w:fldChar w:fldCharType="begin"/></w:r>
<w:r><w:instrText xml:space="preserve"> ADDIN EN.CITE &lt;EndNote&gt;&lt;Cite&gt;&lt;Author&gt;Zeki&lt;/Author&gt;&lt;/Cite&gt;&lt;/EndNote&gt;</w:instrText></w:r>
<w:r><w:fldChar w:fldCharType="separate"/></w:r>
<w:r><w:rPr><w:noProof/></w:rPr>
<w:t>(Zeki and Shipp 1989)</w:t></w:r>
<w:r><w:fldChar w:fldCharType="end"/></w:r>
<w:bookmarkEnd w:id="0"/>
zaxebo1 commented 6 years ago

iandol> Hm, I know of no academics who use the Word Citations feature, it is very basic (manual entry) but quite elegantly integrated in Word, Endnote is sadly the king of the hill for most people.

on the contrary, in my region i have seen (among the MSWord users) users using Word citations feature heavily, and only one using Endnote so far. Hence, it will be a really useful feature.

jkr commented 6 years ago

It's definitely on my to-do list.

tstenner commented 6 years ago

See also https://github.com/jgm/pandoc-citeproc/issues/323

lizdenhup commented 4 years ago

Hello! Has there been any further development in this area? I have come across several manuscripts which make use of Microsoft Word's built-in reference management tool.

For my use case, I would like to go from docx -> markdown and have Pandoc extract the references which were made using Microsoft's reference management tool.

jooyoungseo commented 4 years ago

I was also wondering if this issue could be addressed in the near future.

tarleb commented 4 years ago

Not quite the same, but possibly helpful: https://rintze.zelle.me/ref-extractor/

jonjamesjr commented 3 years ago

I'm wondering if the reverse is also possible -- going from markdown to docx and writing citations as native Word citations? This has the advantage that future users of the docx could (e.g.) add their own references, or change citation style, all from within Word.

My use case: I'm trying to help a friend tidy and format her PhD thesis. I want to go docx -> md, then script whatever tidying is needed, then return md -> docx so she has a useable docx she can continue editing ahead of submission.

pletcher commented 3 years ago

I also wanted to add that this would be helpful for the drafting process. A lot of publications and advisors expect docx, and after a while it often makes sense to pass a document back and forth with track changes turned on. Being able to go back and forth more easily would solve the issue of getting "stuck" in docx-land.

tstenner commented 3 years ago

FWIW, Zotero (and also Mendeley) can include the bibliography data in the document as bookmarks + custom properties.

A single citation has two parts. The citation text in the document.xml:

<w:r>
    <w:t xml:space="preserve">A previous study</w:t>
</w:r>
<w:r>
    <w:t xml:space="preserve"/>
</w:r>
<w:bookmarkStart w:id="5" w:name="ZOTERO_BREF_Tce3DUyb3IiZ"/>
<w:r>
    <w:t>(Hartwigsen, Baumgaertner, et al., 2010)</w:t>
</w:r>
<w:bookmarkEnd w:id="5"/>
<w:r>
    <w:t xml:space="preserve">has shown that 10Hz rTMS of the left or right p</w:t>
</w:r>

and the bibliography data as custom properties (CSL JSON.formatted) in the custom.xml:

<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="343" name="ZOTERO_BREF_Tce3DUyb3IiZ_1">
    <vt:lpwstr>ZOTERO_ITEM CSL_CITATION {"citationID":"0Bf2LOgP","properties":{"formattedCitation":"(Hartwigsen, Baumgaertner, et al., 2010)","plainCitation":"(Hartwigsen, Baumgaertner, et al., 2010)","noteIndex":0},"citationItems":[{"id":520,"uris":["http://zotero.org/</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="344" name="ZOTERO_BREF_Tce3DUyb3IiZ_2">
    <vt:lpwstr>users/1707117/items/95WI3CQL"],"uri":["http://zotero.org/users/1707117/items/95WI3CQL"],"itemData":{"id":520,"type":"article-journal","title":"Phonological decisions require both the left and right supramarginal gyri","container-title":"Proc. Natl. Acad.</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="345" name="ZOTERO_BREF_Tce3DUyb3IiZ_3">
        <vt:lpwstr>Sci.","page":"16494–16499","volume":"107","issue":"38","DOI":"10.1073/pnas.1008121107","ISSN":"0027-8424","note":"PMID: 20807747","author":[{"family":"Hartwigsen","given":"Gesa"},{"family":"Baumgaertner","given":"Annette"},{"family":"Price","given":"Cathy</vt:lpwstr>
    </property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="346" name="ZOTERO_BREF_Tce3DUyb3IiZ_4">
    <vt:lpwstr>J."},{"family":"Koehnke","given":"M"},{"family":"Ulmer","given":"S"},{"family":"Siebner","given":"Hartwig Roman"}],"issued":{"date-parts":[["2010",9]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"}</vt:lpwstr>
</property>

So, recovering citations requires support for reading docx custom properties (#3034) and bookmarks (#6781). The resulting markdown would look like this:

---
ZOTERO_BREF_Tce3DUyb3IiZ_1:  'ZOTERO_ITEM CSL_CITATION {"citationID":"0Bf2LOgP","properties":{"formattedCitation":"(Hartwigsen, Baumgaertner, et al., 2010)","plainCitation":"(Hartwigsen, Baumgaertner, et al., 2010)","noteIndex":0},"citationItems":[{"id":520,"uris":["http://zotero.org/'
ZOTERO_BREF_Tce3DUyb3IiZ_2:  'users/1707117/items/95WI3CQL"],"uri":["http://zotero.org/users/1707117/items/95WI3CQL"],"itemData":{"id":520,"type":"article-journal","title":"Phonological decisions require both the left and right supramarginal gyri","container-title":"Proc. Natl. Acad.'
ZOTERO_BREF_Tce3DUyb3IiZ_3:  'Sci.","page":"16494–16499","volume":"107","issue":"38","DOI":"10.1073/pnas.1008121107","ISSN":"0027-8424","note":"PMID: 20807747","author":[{"family":"Hartwigsen","given":"Gesa"},{"family":"Baumgaertner","given":"Annette"},{"family":"Price","given":"Cathy'
ZOTERO_BREF_Tce3DUyb3IiZ_4:  'J."},{"family":"Koehnke","given":"M"},{"family":"Ulmer","given":"S"},{"family":"Siebner","given":"Hartwig Roman"}],"issued":{"date-parts":[["2010",9]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"}'
...

A previous study [(Hartwigsen, Baumgaertner, et al., 2010)](#ZOTERO_BREF_Tce3DUyb3IiZ) has shown that 10Hz rTMS of the left or right p

After that, the rest could be done with custom filter.

agusmba commented 3 years ago

I've tested the native ooxml citations using pandoc's reference docx (see attached file at the end).

If you insert a citation, you get this in the text (document.xml):

            <w:sdt>
                <w:sdtPr>
                    <w:id w:val="692810013"/>
                    <w:citation/>
                </w:sdtPr>
                <w:sdtContent>
                    <w:r w:rsidR="006E4349">
                        <w:fldChar w:fldCharType="begin"/>
                    </w:r>
                    <w:r w:rsidR="006E4349">
                        <w:rPr>
                            <w:lang w:val="es-ES"/>
                        </w:rPr>
                        <w:instrText xml:space="preserve"> CITATION Aut21 \l 3082 </w:instrText>
                    </w:r>
                    <w:r w:rsidR="006E4349">
                        <w:fldChar w:fldCharType="separate"/>
                    </w:r>
                    <w:r w:rsidR="006E4349">
                        <w:rPr>
                            <w:noProof/>
                            <w:lang w:val="es-ES"/>
                        </w:rPr>
                        <w:t xml:space="preserve"></w:t>
                    </w:r>
                    <w:r w:rsidR="006E4349" w:rsidRPr="006E4349">
                        <w:rPr>
                            <w:noProof/>
                            <w:lang w:val="es-ES"/>
                        </w:rPr>
                        <w:t>[1]</w:t>
                    </w:r>
                    <w:r w:rsidR="006E4349">
                        <w:fldChar w:fldCharType="end"/>
                    </w:r>
                </w:sdtContent>
            </w:sdt>
        </w:p>

And additional xml files: customXml/item1.xml

<b:Sources xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" 
xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" SelectedStyle="\IEEE2006OfficeOnline.xsl" StyleName="IEEE" Version="2006">
<b:Source>
<b:Tag>Aut21</b:Tag>
<b:SourceType>DocumentFromInternetSite</b:SourceType>
<b:Guid>{D07F09F5-5F6E-41E1-A805-3722F7520E45}</b:Guid>
<b:Title>Website_name_value</b:Title>
<b:Year>2021</b:Year><b:Month>01</b:Month><b:Day>01</b:Day>
<b:YearAccessed>2021</b:YearAccessed><b:MonthAccessed>01</b:MonthAccessed><b:DayAccessed>17</b:DayAccessed>
<b:URL>https://www.example.com</b:URL>
<b:Author>
  <b:Author><b:NameList><b:Person><b:Last>Author_value</b:Last></b:Person></b:NameList></b:Author>
</b:Author><b:RefOrder>1</b:RefOrder>
</b:Source>
</b:Sources>

customXml/itemProps1.xml 7 lines no special content

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ds:datastoreItem ds:itemID="{C6A3015E-D30B-444E-98E5-7A504E4E7FCF}"
    xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml">
    <ds:schemaRefs>
        <ds:schemaRef ds:uri="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"/>
    </ds:schemaRefs>
</ds:datastoreItem>

customXml/_rels/item1.xml.rels:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships
    xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXmlProps" Target="itemProps1.xml"/>
</Relationships>
If we add the smart field bibliography, we get in `document.xml` (collapsed due to length) ```xml Bibliografía BIBLIOGRAPHY [1] Author_value, «Website_name_value,» 01 01 2021. [En línea]. Available: https://www.example.com. [Último acceso: 17 01 2021]. ```

If we add additional bibliographic entries, they are added to customXml/item1.xml:

<b:Sources xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"
 xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" SelectedStyle="\IEEE2006OfficeOnline.xsl" StyleName="IEEE" Version="2006">
 <b:Source>
 <b:Tag>Aut21</b:Tag>
 <b:SourceType>DocumentFromInternetSite</b:SourceType>
 <b:Guid>{D07F09F5-5F6E-41E1-A805-3722F7520E45}</b:Guid>
 <b:Title>Website_name_value</b:Title><b:Year>2021</b:Year><b:Month>01</b:Month><b:Day>01</b:Day>
 <b:YearAccessed>2021</b:YearAccessed><b:MonthAccessed>01</b:MonthAccessed><b:DayAccessed>17</b:DayAccessed>
 <b:URL>https://www.example.com</b:URL>
 <b:Author><b:Author><b:NameList><b:Person><b:Last>Author_value</b:Last></b:Person></b:NameList></b:Author></b:Author>
 <b:RefOrder>1</b:RefOrder>
 </b:Source>
 <b:Source>
 <b:Tag>Boo21</b:Tag>
 <b:SourceType>Book</b:SourceType>
 <b:Guid>{93F2D6C8-FC81-44E3-91FF-8407E1703778}</b:Guid>
 <b:Title>Book_title_value</b:Title><b:Year>2021</b:Year>
 <b:Author><b:Author><b:NameList><b:Person><b:Last>Book_author_value</b:Last></b:Person></b:NameList></b:Author></b:Author>
 <b:City>City_value</b:City>
 <b:Publisher>Editorial_value</b:Publisher>
 <b:RefOrder>2</b:RefOrder>
 </b:Source>
 </b:Sources>

custom-reference.docx

berserkwarwolf commented 3 years ago

How much would it cost to make this ? I think I could turn this into a succesful paid bounty.

tarleb commented 3 years ago

Software estimates are notoriously difficult, so take this with a thick crust of coarse salt: I'd guess it would take somewhere between 5 and 20 hours to get it coded and tested, depending on developer speed and experience. Typical software engineer rates differ wildly across regions and experience levels, roughly ranging from around $10 to $250 per hour. So, assuming the time estimate is correct, then this could land anywhere between $50 and $5000 with a median around $500.

And let's not forget about Hofstadter's Law: "It always takes longer than you expect, even when you take into account Hofstadter's Law".

Or in other words: I have no idea. Sorry. Maybe asking for quotes (e.g., here) could lead to better, reliable estimates.

charukiewicz commented 3 years ago

I'm just a user of Pandoc but I run a custom software development company and Haskell is one of our main areas of focus. I haven't looked into this specific request in great detail, but I can provide some general pricing guidelines:

Happy to answer any other questions on the matter.

bbernicker commented 2 years ago

@berserkwarwolf would you still be interested in doing a bounty for this? I would be willing to split it with you. Just email me at brendan@writebriefly.com and we can discuss.

KaptenLutman commented 9 months ago

@agusmba I could successfully write the docx bibliography to html but no hyperlinks are created from the main text to the bibliography entries. "Save as PDF" in Word doesn't honour the links either. However, I found a workaround:

  1. Copy the Word-generated bibliography (without numbers) inside the docx and remove the generated bibliography (a field).
  2. Insert a custom caption label "Reference" (or whatever) in front of the first reference and add brackets, e.g., [1]. I haven't found any way to define a label with brackets.
  3. Copy the first label to the other references, select them all, and press F9 (this will renumber them nicely).
  4. Use Word's cross-reference under reference type "Reference" to insert a citation. All references are fully visible in the menu.
  5. Add "]" manually at the citation (Word only copies what's in front of the label number).
  6. If references are deleted or added, renumber as in (3).
  7. Select all and press shift+F9 to toggle fields.
  8. To have pandoc create hyperlinks in the html output, globally replace "REF" in the fields by "HYPERLINK \l ". This creates links to other captions as well, for example, the built-in labels Figure, Equation, Table,
  9. Toggle all fields again, save docx, and run pandoc.

Steps 8-10 should be done after the document has been finalized.

agusmba commented 9 months ago

Yeah, I seem to recall that native Word bibliography didn't add hyperlinks like pandoc's current default does.

allefeld commented 1 week ago

It seems in recent versions of Zotero, the complete bibliographic information is embedded. This tool which works for Zotero and Mendeley could be a starting point: https://rintze.zelle.me/ref-extractor/