Open jkr opened 6 years ago
+++ Jesse Rosenthal [Dec 11 17 13:45 ]:
Ideally, if we're going docx->markdown, we'd expect the citations to be similarly built into the markdown file. [1]@jgm, do you think it would work to have this conversion produce yaml-block citation entries?
Sure, this could be done. It's just a matter of adding a
references
field to the metadata (or adding items to an
existing references
field).
Okay -- I'll take a look at how bearable the xml format is. Although I wonder how many people use this, vs zotero or endnote or whatever.
I'll also take a look around and see how much (a) people seem to use
this feature, and (b) how easy it is to export citations into
zotero/endnote/etc. If it looks useful, it might be worth considering
putting it the writer (sort of as docx analogue to
--biblatex
/--natbib
).
John MacFarlane notifications@github.com writes:
+++ Jesse Rosenthal [Dec 11 17 13:45 ]:
Ideally, if we're going docx->markdown, we'd expect the citations to be similarly built into the markdown file. [1]@jgm, do you think it would work to have this conversion produce yaml-block citation entries?
Sure, this could be done. It's just a matter of adding a
references
field to the metadata (or adding items to an existingreferences
field).-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/4140#issuecomment-350761949
Hm, I know of no academics who use the Word Citations feature, it is very basic (manual entry) but quite elegantly integrated in Word, Endnote is sadly the king of the hill for most people. Being able to retain reference data from the reader into YAML metadata would be neat, but I suspect the number of users would be exceedingly small :-)
I do wonder if Endnote uses the same XML store for its travelling library (it can store refs in a word doc), if this was the case being able to import the references would become a much more useful feature!!! But I imagine Endnote deliberately separates them, I'll have a look at a sample docx and see...
Endnote places all reference data in document.xml directly as escaped xml:
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:r><w:fldChar w:fldCharType="begin"/></w:r>
<w:r><w:instrText xml:space="preserve"> ADDIN EN.CITE <EndNote><Cite><Author>Zeki</Author></Cite></EndNote></w:instrText></w:r>
<w:r><w:fldChar w:fldCharType="separate"/></w:r>
<w:r><w:rPr><w:noProof/></w:rPr>
<w:t>(Zeki and Shipp 1989)</w:t></w:r>
<w:r><w:fldChar w:fldCharType="end"/></w:r>
<w:bookmarkEnd w:id="0"/>
iandol> Hm, I know of no academics who use the Word Citations feature, it is very basic (manual entry) but quite elegantly integrated in Word, Endnote is sadly the king of the hill for most people.
on the contrary, in my region i have seen (among the MSWord users) users using Word citations feature heavily, and only one using Endnote so far. Hence, it will be a really useful feature.
It's definitely on my to-do list.
Hello! Has there been any further development in this area? I have come across several manuscripts which make use of Microsoft Word's built-in reference management tool.
For my use case, I would like to go from docx -> markdown and have Pandoc extract the references which were made using Microsoft's reference management tool.
I was also wondering if this issue could be addressed in the near future.
Not quite the same, but possibly helpful: https://rintze.zelle.me/ref-extractor/
I'm wondering if the reverse is also possible -- going from markdown to docx and writing citations as native Word citations? This has the advantage that future users of the docx could (e.g.) add their own references, or change citation style, all from within Word.
My use case: I'm trying to help a friend tidy and format her PhD thesis. I want to go docx -> md, then script whatever tidying is needed, then return md -> docx so she has a useable docx she can continue editing ahead of submission.
I also wanted to add that this would be helpful for the drafting process. A lot of publications and advisors expect docx, and after a while it often makes sense to pass a document back and forth with track changes turned on. Being able to go back and forth more easily would solve the issue of getting "stuck" in docx-land.
FWIW, Zotero (and also Mendeley) can include the bibliography data in the document as bookmarks + custom properties.
A single citation has two parts. The citation text in the document.xml
:
<w:r>
<w:t xml:space="preserve">A previous study</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"/>
</w:r>
<w:bookmarkStart w:id="5" w:name="ZOTERO_BREF_Tce3DUyb3IiZ"/>
<w:r>
<w:t>(Hartwigsen, Baumgaertner, et al., 2010)</w:t>
</w:r>
<w:bookmarkEnd w:id="5"/>
<w:r>
<w:t xml:space="preserve">has shown that 10Hz rTMS of the left or right p</w:t>
</w:r>
and the bibliography data as custom properties (CSL JSON.formatted) in the custom.xml
:
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="343" name="ZOTERO_BREF_Tce3DUyb3IiZ_1">
<vt:lpwstr>ZOTERO_ITEM CSL_CITATION {"citationID":"0Bf2LOgP","properties":{"formattedCitation":"(Hartwigsen, Baumgaertner, et al., 2010)","plainCitation":"(Hartwigsen, Baumgaertner, et al., 2010)","noteIndex":0},"citationItems":[{"id":520,"uris":["http://zotero.org/</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="344" name="ZOTERO_BREF_Tce3DUyb3IiZ_2">
<vt:lpwstr>users/1707117/items/95WI3CQL"],"uri":["http://zotero.org/users/1707117/items/95WI3CQL"],"itemData":{"id":520,"type":"article-journal","title":"Phonological decisions require both the left and right supramarginal gyri","container-title":"Proc. Natl. Acad.</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="345" name="ZOTERO_BREF_Tce3DUyb3IiZ_3">
<vt:lpwstr>Sci.","page":"16494–16499","volume":"107","issue":"38","DOI":"10.1073/pnas.1008121107","ISSN":"0027-8424","note":"PMID: 20807747","author":[{"family":"Hartwigsen","given":"Gesa"},{"family":"Baumgaertner","given":"Annette"},{"family":"Price","given":"Cathy</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="346" name="ZOTERO_BREF_Tce3DUyb3IiZ_4">
<vt:lpwstr>J."},{"family":"Koehnke","given":"M"},{"family":"Ulmer","given":"S"},{"family":"Siebner","given":"Hartwig Roman"}],"issued":{"date-parts":[["2010",9]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"}</vt:lpwstr>
</property>
So, recovering citations requires support for reading docx custom properties (#3034) and bookmarks (#6781). The resulting markdown would look like this:
---
ZOTERO_BREF_Tce3DUyb3IiZ_1: 'ZOTERO_ITEM CSL_CITATION {"citationID":"0Bf2LOgP","properties":{"formattedCitation":"(Hartwigsen, Baumgaertner, et al., 2010)","plainCitation":"(Hartwigsen, Baumgaertner, et al., 2010)","noteIndex":0},"citationItems":[{"id":520,"uris":["http://zotero.org/'
ZOTERO_BREF_Tce3DUyb3IiZ_2: 'users/1707117/items/95WI3CQL"],"uri":["http://zotero.org/users/1707117/items/95WI3CQL"],"itemData":{"id":520,"type":"article-journal","title":"Phonological decisions require both the left and right supramarginal gyri","container-title":"Proc. Natl. Acad.'
ZOTERO_BREF_Tce3DUyb3IiZ_3: 'Sci.","page":"16494–16499","volume":"107","issue":"38","DOI":"10.1073/pnas.1008121107","ISSN":"0027-8424","note":"PMID: 20807747","author":[{"family":"Hartwigsen","given":"Gesa"},{"family":"Baumgaertner","given":"Annette"},{"family":"Price","given":"Cathy'
ZOTERO_BREF_Tce3DUyb3IiZ_4: 'J."},{"family":"Koehnke","given":"M"},{"family":"Ulmer","given":"S"},{"family":"Siebner","given":"Hartwig Roman"}],"issued":{"date-parts":[["2010",9]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"}'
...
A previous study [(Hartwigsen, Baumgaertner, et al., 2010)](#ZOTERO_BREF_Tce3DUyb3IiZ) has shown that 10Hz rTMS of the left or right p
After that, the rest could be done with custom filter.
I've tested the native ooxml citations using pandoc's reference docx (see attached file at the end).
If you insert a citation, you get this in the text (document.xml):
<w:sdt>
<w:sdtPr>
<w:id w:val="692810013"/>
<w:citation/>
</w:sdtPr>
<w:sdtContent>
<w:r w:rsidR="006E4349">
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="006E4349">
<w:rPr>
<w:lang w:val="es-ES"/>
</w:rPr>
<w:instrText xml:space="preserve"> CITATION Aut21 \l 3082 </w:instrText>
</w:r>
<w:r w:rsidR="006E4349">
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="006E4349">
<w:rPr>
<w:noProof/>
<w:lang w:val="es-ES"/>
</w:rPr>
<w:t xml:space="preserve"></w:t>
</w:r>
<w:r w:rsidR="006E4349" w:rsidRPr="006E4349">
<w:rPr>
<w:noProof/>
<w:lang w:val="es-ES"/>
</w:rPr>
<w:t>[1]</w:t>
</w:r>
<w:r w:rsidR="006E4349">
<w:fldChar w:fldCharType="end"/>
</w:r>
</w:sdtContent>
</w:sdt>
</w:p>
And additional xml files: customXml/item1.xml
<b:Sources xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"
xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" SelectedStyle="\IEEE2006OfficeOnline.xsl" StyleName="IEEE" Version="2006">
<b:Source>
<b:Tag>Aut21</b:Tag>
<b:SourceType>DocumentFromInternetSite</b:SourceType>
<b:Guid>{D07F09F5-5F6E-41E1-A805-3722F7520E45}</b:Guid>
<b:Title>Website_name_value</b:Title>
<b:Year>2021</b:Year><b:Month>01</b:Month><b:Day>01</b:Day>
<b:YearAccessed>2021</b:YearAccessed><b:MonthAccessed>01</b:MonthAccessed><b:DayAccessed>17</b:DayAccessed>
<b:URL>https://www.example.com</b:URL>
<b:Author>
<b:Author><b:NameList><b:Person><b:Last>Author_value</b:Last></b:Person></b:NameList></b:Author>
</b:Author><b:RefOrder>1</b:RefOrder>
</b:Source>
</b:Sources>
customXml/itemProps1.xml
7 lines no special content
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ds:datastoreItem ds:itemID="{C6A3015E-D30B-444E-98E5-7A504E4E7FCF}"
xmlns:ds="http://schemas.openxmlformats.org/officeDocument/2006/customXml">
<ds:schemaRefs>
<ds:schemaRef ds:uri="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"/>
</ds:schemaRefs>
</ds:datastoreItem>
customXml/_rels/item1.xml.rels
:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships
xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/customXmlProps" Target="itemProps1.xml"/>
</Relationships>
If we add additional bibliographic entries, they are added to customXml/item1.xml
:
<b:Sources xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"
xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" SelectedStyle="\IEEE2006OfficeOnline.xsl" StyleName="IEEE" Version="2006">
<b:Source>
<b:Tag>Aut21</b:Tag>
<b:SourceType>DocumentFromInternetSite</b:SourceType>
<b:Guid>{D07F09F5-5F6E-41E1-A805-3722F7520E45}</b:Guid>
<b:Title>Website_name_value</b:Title><b:Year>2021</b:Year><b:Month>01</b:Month><b:Day>01</b:Day>
<b:YearAccessed>2021</b:YearAccessed><b:MonthAccessed>01</b:MonthAccessed><b:DayAccessed>17</b:DayAccessed>
<b:URL>https://www.example.com</b:URL>
<b:Author><b:Author><b:NameList><b:Person><b:Last>Author_value</b:Last></b:Person></b:NameList></b:Author></b:Author>
<b:RefOrder>1</b:RefOrder>
</b:Source>
<b:Source>
<b:Tag>Boo21</b:Tag>
<b:SourceType>Book</b:SourceType>
<b:Guid>{93F2D6C8-FC81-44E3-91FF-8407E1703778}</b:Guid>
<b:Title>Book_title_value</b:Title><b:Year>2021</b:Year>
<b:Author><b:Author><b:NameList><b:Person><b:Last>Book_author_value</b:Last></b:Person></b:NameList></b:Author></b:Author>
<b:City>City_value</b:City>
<b:Publisher>Editorial_value</b:Publisher>
<b:RefOrder>2</b:RefOrder>
</b:Source>
</b:Sources>
How much would it cost to make this ? I think I could turn this into a succesful paid bounty.
Software estimates are notoriously difficult, so take this with a thick crust of coarse salt: I'd guess it would take somewhere between 5 and 20 hours to get it coded and tested, depending on developer speed and experience. Typical software engineer rates differ wildly across regions and experience levels, roughly ranging from around $10 to $250 per hour. So, assuming the time estimate is correct, then this could land anywhere between $50 and $5000 with a median around $500.
And let's not forget about Hofstadter's Law: "It always takes longer than you expect, even when you take into account Hofstadter's Law".
Or in other words: I have no idea. Sorry. Maybe asking for quotes (e.g., here) could lead to better, reliable estimates.
I'm just a user of Pandoc but I run a custom software development company and Haskell is one of our main areas of focus. I haven't looked into this specific request in great detail, but I can provide some general pricing guidelines:
Happy to answer any other questions on the matter.
@berserkwarwolf would you still be interested in doing a bounty for this? I would be willing to split it with you. Just email me at brendan@writebriefly.com and we can discuss.
@agusmba I could successfully write the docx bibliography to html but no hyperlinks are created from the main text to the bibliography entries. "Save as PDF" in Word doesn't honour the links either. However, I found a workaround:
Steps 8-10 should be done after the document has been finalized.
Yeah, I seem to recall that native Word bibliography didn't add hyperlinks like pandoc's current default does.
It seems in recent versions of Zotero, the complete bibliographic information is embedded. This tool which works for Zotero and Mendeley could be a starting point: https://rintze.zelle.me/ref-extractor/
Docx has a built in reference tracker, which keeps the references in an xml file in the document (and can change citation format on the fly). The docx reader could read these and convert them into pandoc citations.
Ideally, if we're going docx->markdown, we'd expect the citations to be similarly built into the markdown file. @jgm, do you think it would work to have this conversion produce yaml-block citation entries?