jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.88k stars 3.34k forks source link

Support Zotero citations in docx writer #9718

Open retorquere opened 4 months ago

retorquere commented 4 months ago

@jgm suggested I open the issue here (from a discussion at https://github.com/retorquere/zotero-better-bibtex/discussions/2862#discussioncomment-9297727).

proposed improvement and the problem it solves.

I currently produce zotero-compatible citations during conversion from markdown to docx. pandoc can parse these as -f docx+citations, I'd love to be able to also produce them in a filter ran while converting from markdown to docx. I currently produce them by replacing the Cite with a RawInline, but I'd much prefer it if I could do it using AST manipulation (which would also save me the trouble of parsing locators and such).

Describe alternatives you've considered.

I've implemented it manually at https://retorque.re/zotero-better-bibtex/exporting/pandoc/index.html#from-markdown-to-zotero-live-citations

jgm commented 4 months ago

I think what you're asking for is for -t docx+citations to include Zotero citations.

There is a problem, though. -f docx+citations will currently handle citations of all sorts, including also EndNote and Mendeley. So we'd either have an odd asymmetry (if the writer just does Zotero) or we'd need to change the extensions so that there's a separate one for each type.

retorquere commented 4 months ago

I'd be OK if I can just manipulate the Cite nodes in the AST such that the result is zotero-compatible citations. It'd be more an filter api change than a command line option change. As far as I can tell, generating live citations is going to require some sort of interaction with zotero/mendeley/etc, not something I see as a strong candidate for building into pandoc itself.

But maybe my current approach suffices. I'd just be happy of being rid of my own prefix/suffix parser when I know Pandoc already does this. If just that could be exposed to the filter that'd simplify filters doing live citations a whole lot.

jgm commented 4 months ago

I'm not really clear on what you want. You can already manipulate Cite nodes in the AST, in Lua filters. What exactly do you think is missing in the Lua API?

generating live citations is going to require some sort of interaction with zotero/mendeley/etc

Not necessarily. Pandoc can use its own citeproc to generate the formatted citation; it would just be a matter of including the extra fields that link it to Zotero. I can't remember how all of this is done, but if you had a sample of a docx with Zotero citations, I could take a look.

retorquere commented 4 months ago

What exactly do you think is missing in the Lua API?

The only way I knew how was to replace the Cite using RawInline. I don't know of a way to modify/create AST nodes to achieve the same results. Plus, there's the parsing of the prefix/postfix/locator stuff, which in my script is currently a fragile heap of lpeg expressions. If just that could reuse the parsing that pandoc undoubtedly already has, that would be huge.

it would just be a matter of including the extra fields that link it to Zotero

That would be the item URI in the case of Zotero, but if your starting point is a Markdown document with citation keys, you won't have them available. It could be done if the user adds a references meta section I suppose. But I reckon users would prefer being able to skip that step.

(edit: the formatted citation would be nice but it isn't actually required to get live citations. My script doesn't)

jgm commented 4 months ago

The only way I knew how was to replace the Cite using RawInline. I don't know of a way to modify/create AST nodes to achieve the same results.

You can modify the AST node perfectly well. https://pandoc.org/lua-filters.html#type-citation It sounds like you're trying to produce a docx that includes embedded Zotero fields? If so, then this is about docx writer support as I suggested, not about the AST.

That would be the item URI in the case of Zotero, but if your starting point is a Markdown document with citation keys, you won't have them available. It could be done if the user adds a references meta section I suppose. But I reckon users would prefer being able to skip that step.

That would be the item URI in the case of Zotero, but if your starting point is a Markdown document with citation keys, you won't have them available.

That's true. It would require that the citation-key is defined in the database. So maybe this wouldn't be too useful in practice.

retorquere commented 4 months ago

You can modify the AST node perfectly well. https://pandoc.org/lua-filters.html#type-citation It sounds like you're trying to produce a docx that includes embedded Zotero fields? If so, then this is about docx writer support as I suggested, not about the AST.

I'm trying to describe situation best I can, but I'm very likely misunderstanding the technology and/or using the wrong terminology.

I am indeed trying to produce a docx the includes embedded zotero fields. If the Lua filter can play a role in in the docx writer (I'm not even sure this sentence is meaningful) and in that way avoid duplicating functionality already present in pandoc, that'd be great. If that doesn't fit your strategy for pandoc, my script works. It's just not very elegant.

That's true. It would require that the citation-key is defined in the database. So maybe this wouldn't be too useful in practice.

But it can be made easier to implement. A lot of the code I wrote in my script is going to be generic between implementations for mendeley, zotero, etc.

paul-kelleher commented 2 months ago

I've implemented it manually at https://retorque.re/zotero-better-bibtex/exporting/pandoc/index.html#from-markdown-to-zotero-live-citations

I am co-authoring with someone who wants to use Word, while I want to write in markdown and convert with Pandoc. We both want to manage citations with Zotero. I've been looking for a way to create a docx with pandoc that produces live Zotero citations, and you have produced a lua filter that does this exactly (so far as I can tell). Thank you!!!

iandol commented 2 months ago

Plus, there's the parsing of the prefix/postfix/locator stuff, which in my script is currently a fragile heap of lpeg expressions. If just that could reuse the parsing that pandoc undoubtedly already has, that would be huge.

@retorquere -- I admit I am also a bit confused by what you want. Currently citations from Pandoc are already parsed and available:

pandoc -t native

Blah blah [see @doe99, pp. 33-35 and *passim*; @smith04, chap. 1].

[ Para
    [ Str "Blah"
    , Space
    , Str "blah"
    , Space
    , Cite
        [ Citation
            { citationId = "doe99"
            , citationPrefix = [ Str "see" ]
            , citationSuffix =
                [ Str ","
                , Space
                , Str "pp.\160\&33-35"
                , Space
                , Str "and"
                , Space
                , Emph [ Str "passim" ]
                ]
            , citationMode = NormalCitation
            , citationNoteNum = 1
            , citationHash = 0
            }
        , Citation
            { citationId = "smith04"
            , citationPrefix = []
            , citationSuffix = [ Str "," , Space , Str "chap.\160\&1" ]
            , citationMode = NormalCitation
            , citationNoteNum = 1
            , citationHash = 0
            }
        ]
        [ Str "[see"
        , Space
        , Str "@doe99,"
        , Space
        , Str "pp."
        , Space
        , Str "33-35"
        , Space
        , Str "and"
        , Space
        , Str "*passim*;"
        , Space
        , Str "@smith04,"
        , Space
        , Str "chap."
        , Space
        , Str "1]"
        ]
    , Str "."
    ]
]

The prefix suffix and id are all available in the AST. The AST is just that, abstract, it is not responsible for document-specific syntax, that is the job of the writer. What I think you are saying is that you want some way to wrap the AST nodes so that they become the correct XML fields for ODT/DOCX for Zotero. Perhaps if you can show what the markdown is, and what the final XML output is desired. It should be possible for the ODT/DOCX writer to generate this correct XML, but as @jgm mentioned, this will be different for different bibliography tools? This is the annoying cost of "live" citations, there is no standard, no consistency, and so each tool has one or more ways to that are specific for each document format. The ideal is that Bookends/Zotero/Mendeley/Papers standardised to a single representation for ODT / DOCX...

retorquere commented 2 months ago

To be clear, I'm not saying there's anything you must do, but in this example, citationSuffix can be broken down further in the locator and its value, and since pandoc uses a CSL processor, I'm assuming it is in fact doing this breakdown somewhere; it would be useful to me if that functionality were available in lua filters, since I'm currently doing it myself with a bunch of lpeg expressions that I would love to do without. This isn't document-specific, I think, it is CSL-specific. But the lpeg expressions work, so there isn't anything I am unable to do right now. I just don't much like lpeg, it was a bear to get them to work, and I dread the day that I would have to diagnose a bug that touches these.

The earlier discussion was indeed about a way to wrap AST nodes so that they would result in correct XML for ODT/DOCX for Zotero, but that is easily achieved with raw XML output; it was more curiosity than anything else. It's not in any way a priority to me.

jgm commented 2 months ago

We could consider exposing the code that separates the locator, locator label, and remaining suffix, but that's getting pretty specialized.

You could always use citeproc in pandoc.utils to render the citations using a specially crafted CSL file that just gives you the label, the locator, and the suffix. This should already work.

retorquere commented 2 months ago

That's interesting. I could use that in the Lua filter to break out these parts?

paul-kelleher commented 2 months ago

@retorquere I am not savvy enough to be able to follow the technical discussion here. Since your lua filter seems to work great for my use case, I was wondering if there was a nontechnical statement of what other functionality you were wanting but not able to get out of the lua filter itself?

iandol commented 2 months ago

That's interesting. I could use that in the Lua filter to break out these parts?

https://pandoc.org/lua-filters.html#pandoc.utils.citeproc -- this returns the whole processed document which would need some parsing, but may be easier than the current lpeg filters?

jgm commented 2 months ago

Well, my thought was that you could construct a document consisting of just the one citation you're interested in, and run citeproc on it using a special CSL style that just prints the label, locator, and suffix, separated by newlines. Then you can extract this information from the result. Probably not a very good way to do it, but possible.

retorquere commented 2 months ago

https://pandoc.org/lua-filters.html#pandoc.utils.citeproc -- this returns the whole processed document which would need some parsing, but may be easier than the current lpeg filters?

This returns the rendered doc in the output format the filter is currently running under I suppose? Would that be docx (a zipped xml file) in my case?

retorquere commented 2 months ago

Well, my thought was that you could construct a document consisting of just the one citation you're interested in, and run citeproc on it using a special CSL style that just prints the label, locator, and suffix, separated by newlines. Then you can extract this information from the result. Probably not a very good way to do it, but possible.

Interesting enough to try. Can I force the output format of the csl call to markdown? That'd be easier to parse.

retorquere commented 2 months ago

@retorquere I am not savvy enough to be able to follow the technical discussion here. Since your lua filter seems to work great for my use case, I was wondering if there was a nontechnical statement of what other functionality you were wanting but not able to get out of the lua filter itself?

No extra functionality, it'd just replace part of the filter with something (hopefully) less complex, technically. Outwardly you wouldn't be able to tell the difference.

jgm commented 2 months ago

Interesting enough to try. Can I force the output format of the csl call to markdown?

Calling citeproc will just give you a new Pandoc structure. You can render it to markdown if you like, but it is already parsed, in effect...

retorquere commented 2 months ago

Ah cool!