jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.81k stars 3.39k forks source link

Support Zotero citations in docx #7840

Closed tarleb closed 2 years ago

tarleb commented 2 years ago

Sample code:

        <w:p w14:paraId="2C4DA6EC" w14:textId="293040C8" w:rsidR="009E522F" w:rsidRDefault="002E5F17">
            <w:r>
                <w:fldChar w:fldCharType="begin"/>
            </w:r>
            <w:r>
                <w:instrText xml:space="preserve"> ADDIN ZOTERO_ITEM CSL_CITATION {"citationID":"AQwSemPs","properties":{"formattedCitation":"(Hawking, 2010)","plainCitation":"(Hawking, 2010)","noteIndex":0},"citationItems":[{"id":46,"uris":["http://zotero.org/users/40613/items/EAG35HWU"],"uri":["http://zotero.org/users/40613/items/EAG35HWU"],"itemData":{"id":46,"type":"article-journal","title":"Test article one","author":[{"family":"Hawking","given":"Stephen"}],"issued":{"date-parts":[["2010"]]}}}],"schema":"https://github.com/citation-style-language/schema/raw/master/csl-citation.json"} </w:instrText>
            </w:r>
            <w:r>
                <w:fldChar w:fldCharType="separate"/>
            </w:r>
            <w:r w:rsidRPr="002E5F17">
                <w:rPr>
                    <w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:cs="Calibri"/>
                </w:rPr>
                <w:t>(Hawking, 2010)</w:t>
            </w:r>
            <w:r>
                <w:fldChar w:fldCharType="end"/>
            </w:r>
        </w:p>

My understanding is that the XML element comes with a full CSL JSON entry, so this we can parse that.

jgm commented 2 years ago

So is the idea to parse this from docx into a native pandoc Cite inline? The fallback text could be the formattedCitation part of the JSON. In addition, the bibliographic information would have to be extract, converted, and added to references in metadata. Good idea, I think, and it should be fairly straightforward.

jgm commented 2 years ago

Can you upload a complete document containing these, for testing?

frederik commented 2 years ago

Hi @jgm, here's a DOCX with a number of references (including prefixes etc.) for testing. I included the reference list generated by Zotero which looks like it would not be needed if we have the original in-text citations.

zotero-citations.docx

jooyoungseo commented 2 years ago

Would there be any further detailed explanation about this change in user guide? As far as I understood, Zotero fields are now automaticlaly converted into @citation_key and its corresponding bibliography entries in the docx-to-md conversion. Is that correct? Please correct me if I am wrong.

jgm commented 2 years ago

Not yet. We've only implemented the skeleton for this so far.

jgm commented 2 years ago

Can someone upload a sample docx using these Zotero fields? It would be good if it demonstrated the following:

frederik commented 2 years ago

Adding an example:

(Jones, 1999; Smith, 2000) and  the same book again (Jones, 1999)
(see Smith, 2000, Chapter 22 and others)

The first line creates two citations with 2 (Jones & Smith) and 1 citation item (Jones). The item date data from the citations in the re-used book are the same (in this case id is 273 in both cases).

The second line contains the item data for Smith again (same ID 272 as in the first citation). The locator chapter is added to the citation item.

zotero-citations-2.docx

If you need anything else (or would like a discussion on the JSON structures, I'd be happy to help).

jgm commented 2 years ago

Here's a formatted version of the embedde JSON in the above example:

{
  "citationID": "AQwSemPs",
  "properties": {
    "formattedCitation": "(Hawking, 2010)",
    "plainCitation": "(Hawking, 2010)",
    "noteIndex": 0
  },
  "citationItems": [
    {
      "id": 46,
      "uris": [
        "http://zotero.org/users/40613/items/EAG35HWU"
      ],
      "uri": [
        "http://zotero.org/users/40613/items/EAG35HWU"
      ],
      "itemData": {
        "id": 46,
        "type": "article-journal",
        "title": "Test article one",
        "author": [
          {
            "family": "Hawking",
            "given": "Stephen"
          }
        ],
        "issued": {
          "date-parts": [
            [
              "2010"
            ]
          ]
        }
      }
    }
  ],
  "schema": "https://github.com/citation-style-language/schema/raw/master/csl-citation.json"
}
jgm commented 2 years ago

We'll need to modify the CitationItem type in citeproc to allow this kind of embedded itemData:

itemData :: Reference a

After that, we can simply use the FromJSON instance for Citeproc.Citation to parse this. And then we'll need to (a) convert this to a Pandoc.Citation and (b) extract the embedded Reference and put it in state, so it can be added to the references in Metadata.