jgm / pandoc

Universal markup converter
34.78k stars 3.39k forks source link

docx+citations from zotero: different ids used for in-text and CSL-YAML #10366

Closed iandol closed 2 weeks ago

iandol commented 2 weeks ago

I have a Word docx written using Zotero by a collaborator with the following injected example reference (toggle field codes and copy paste):

    "id": "Ml6QQcFl/5Udmir6l",
    "uris": [
    "itemData": {
        "id": 2102,
        "type": "article-journal",
        "abstract": "...",
        "container-title": "NeuroImage",
        "DOI": "10.1016/j.neuroimage.2012.01.078",
        "ISSN": "1095-9572",
        "issue": "2",
        "journalAbbreviation": "Neuroimage",
        "language": "eng",
        "note": "PMID: 22285220",
        "page": "1307-1315",
        "source": "PubMed",
        "title": "Abnormal cortical processing of pattern motion in amblyopia: evidence from fMRI",
        "title-short": "Abnormal cortical processing of pattern motion in amblyo-pia",
        "volume": "60",
        "author": [
                "family": "Thompson",
                "given": "B."
                "family": "Villeneuve",
                "given": "M. Y."
                "family": "Casanova",
                "given": "C."
                "family": "Hess",
                "given": "R. F."
        "issued": {
            "date-parts": [

The important part is there is a base id "id": "Ml6QQcFl/5Udmir6l" and an "itemData": { "id": 2102, — pandoc unfortunately uses the first one for the in-text citation:

However, with further research, fMRI studies have revealed that amblyopic patients exhibit not only functional abnormalities in V1 but also in other regions, such as V2, V3, V4, V5, and higher-order
areas like MT+ [@Ml6QQcFl/5Udmir6l].

but the YAML uses the other id:

- abstract: ...
  - family: Thompson
    given: B.
  - family: Villeneuve
    given: M. Y.
  - family: Casanova
    given: C.
  - family: Hess
    given: R. F.
  container-title: NeuroImage
  container-title-short: Neuroimage
  DOI: 10.1016/j.neuroimage.2012.01.078
  id: 2102
  ISSN: 1095-9572
  issue: 2
  issued: 2012-04-02
  language: eng
  page: 1307-1315
  PMID: 22285220
  source: PubMed
  title: "Abnormal cortical processing of pattern motion in amblyopia:
    evidence from fMRI"
  title-short: Abnormal cortical processing of pattern motion in
  type: article-journal
  volume: 60

As this is a docx from a collaborator, I don't have his database and I don't know why the zotero data is like this (most references are like this), but this is in-the-wild and I'd hope a consistent id selection by pandoc should be easy to do?

iandol commented 2 weeks ago


Minimal test docx

pandoc -s --extract-media=./ -f docx+citations Test.docx -o Test.md
pandoc --version
pandoc 3.5
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/ian/.local/share/pandoc
Copyright (C) 2006-2024 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
jgm commented 2 weeks ago

Here is the citation embedded in Test-2.docx. You can see that the id of the first citationItem is indeed Ml6QQcFl/5Udmir6l. There is then an itemData that embeds bibliographic information, and it uses a different id "2102". I'm not sure how it's supposed to work in this case (if it's not just a mistake), but perhaps we're meant to use the id Ml6QQcFl/5Udmir6l both places? @bdarcus do you know?

  "citationID": "NLAuDP0i",
  "properties": {
    "formattedCitation": "(Thompson et al., 2012)",
    "plainCitation": "(Thompson et al., 2012)",
    "noteIndex": 0
  "citationItems": [
      "id": "Ml6QQcFl/5Udmir6l",
      "uris": [
      "itemData": {
        "id": 2102,
        "type": "article-journal",
        "abstract": "Converging evidence from human psychophysics and animal neurophysiology indicates that amblyopia is associated with abnormal function of area MT, a motion sensitive region of the extrastriate visual cortex. In this context, the recent finding that amblyopic eyes mediate normal perception of dynamic plaid stimuli was surprising, as neural processing and perception of plaids has been closely linked to MT function. One intriguing potential explanation for this discrepancy is that the amblyopic eye recruits alternative visual brain areas to support plaid perception. This is the hypothesis that we tested. We used functional magnetic resonance imaging (fMRI) to measure the response of the amblyopic visual cortex and thalamus to incoherent and coherent motion of plaid stimuli that were perceived normally by the amblyopic eye. We found a different pattern of responses within the visual cortex when plaids were viewed by amblyopic as opposed to non-amblyopic eyes. The non-amblyopic eyes of amblyopes and control eyes differentially activated the hMT+ complex when viewing incoherent vs. coherent plaid motion, consistent with the notion that this region is centrally involved in plaid perception. However, for amblyopic eye viewing, hMT+ activation did not vary reliably with motion type. In a sub-set of our participants with amblyopia we were able to localize MT and MST within the larger hMT+ complex and found a lack of plaid motion selectivity in both sub-regions. The response of the pulvinar and ventral V3 to plaid stimuli also differed under amblyopic vs. non-amblyopic eye viewing conditions, however the response of these areas did vary according to motion type. These results indicate that while the perception of the plaid stimuli was constant for both amblyopic and non-amblyopic viewing, the network of neural areas that supported this perception was different.",
        "container-title": "NeuroImage",
        "DOI": "10.1016/j.neuroimage.2012.01.078",
        "ISSN": "1095-9572",
        "issue": "2",
        "journalAbbreviation": "Neuroimage",
        "language": "eng",
        "note": "PMID: 22285220",
        "page": "1307-1315",
        "source": "PubMed",
        "title": "Abnormal cortical processing of pattern motion in amblyopia: evidence from fMRI",
        "title-short": "Abnormal cortical processing of pattern motion in amblyopia",
        "volume": "60",
        "author": [
            "family": "Thompson",
            "given": "B."
            "family": "Villeneuve",
            "given": "M. Y."
            "family": "Casanova",
            "given": "C."
            "family": "Hess",
            "given": "R. F."
        "issued": {
          "date-parts": [
  "schema": "https://github.com/citation-style-language/schema/raw/master/csl-citation.json"
jgm commented 2 weeks ago

I just pushed a fix that will use the citationItem id in the bibliography, even if the itemData contains a different reference id. If that's wrong, we can change.

iandol commented 1 week ago

Thanks @jgm -- what happens when there is a citation-key, for example, this ref:

    "id": "uh2vLrAB/XwGHp8PL",
    "uris": [
    "itemData": {
        "id": 15691,
        "type": "book",
        "note": "Citation Key: dowling2017\npage: 136",
        "publisher": "International Retinal Research Foundation",
        "title": "Amblyopia: Chal-lenges and opportunities",
        "volume": "The Lasker/IRRF Initiative for Innovation in Vi-sion Science",
        "author": [
                "family": "Dowling",
                "given": "John E."
        "editor": [
                "family": "Dowling",
                "given": "John E."
        "issued": {
            "date-parts": [
        "citation-key": "dowling2017"

...has a main id, an itemData: id: and an itemData: citation-key — the citation-key comes from BetterBibTeX and will be used by Zotero to output BibTeX, so I wonder if the order shouldn't be: itemData: citation-key > itemData: id: > id? I normally use Bookends reference manager so my knowledge of Zotero is very limited...

jgm commented 1 week ago

We use id. If you wanted the other behavior you could use a filter to overwrite id with citation-key (which isn't even an official CSL JSON field, I believe).

iandol commented 1 week ago

Thank you as always!

iandol commented 1 week ago

Just FYI, I just checked the schema for csl-data (which is what itemID is IIUC) and there is a citation-key field:


  "citation-key": {
        "type": "string"

So this would be where the BibTeX key, if present, should be stored. Let me see if I can make a filter to make this replacement, as a workflow where the BibTeX key is used as an id is more flexible overall...

jgm commented 1 week ago

It's not documented for the released version; perhaps it was added later.