jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.2k stars 3.36k forks source link

Some write formats render citations, some do not. #4834

Closed dhimmel closed 6 years ago

dhimmel commented 6 years ago

I've been working on https://github.com/greenelab/manubot/issues/48, where we are using Pandoc to create a reference list from CSL JSON Data and a CSL XML Style. We are passing stdin to Pandoc like the following:

---
{
  "nocite": "@*",
  "csl": "https://github.com/greenelab/manubot-rootstock/raw/master/build/assets/style.csl",
  "references": [
    {
      "archive_location": "1806.05726v1",
      "version": "1",
      "URL": "https://arxiv.org/abs/1806.05726v1",
      "title": "Generalization of the Fermi Pseudopotential",
      "issued": {
        "date-parts": [
          [
            2018,
            6,
            14
          ]
        ]
      },
      "author": [
        {
          "literal": "Trang T. Le"
        },
        {
          "literal": "Zach Osman"
        },
        {
          "literal": "D. K. Watson"
        },
        {
          "literal": "Martin Dunn"
        },
        {
          "literal": "B. A. McKinney"
        }
      ],
      "container-title": "arXiv",
      "type": "report",
      "id": "10GH4uYUR"
    }
  ]
}
...

Locally, I've saved this text in a file named header.yaml. When I run pandoc with different --to options, sometimes the citations are rendered, sometimes they are not. For example, the following commands produce a blank output:

cat header.yaml | pandoc --filter pandoc-citeproc --to=markdown
cat header.yaml | pandoc --filter pandoc-citeproc --to=jats

However, other options render the citations. For example, here is the output with --to=plain

1. GENERALIZATION OF THE FERMI PSEUDOPOTENTIAL
Trang T. Le, Zach Osman, D. K. Watson, Martin Dunn, B. A. McKinney
_arXiv_ (2018-06-14) https://arxiv.org/abs/1806.05726v1

Output with --to=markdown_strict

1. **Generalization of the Fermi Pseudopotential**  
Trang T. Le, Zach Osman, D. K. Watson, Martin Dunn, B. A. McKinney  
*arXiv* (2018-06-14) <https://arxiv.org/abs/1806.05726v1>

I'm using pandoc 2.2.1. I didn't see anything in the documentation about which output formats render citations. Is there an a priori way to know whether a format supports citation rendering? Is there a way to force a format to render the citations, even if it doesn't by default?

jgm commented 6 years ago

All output formats work with citations.

You're not seeing anything for -t markdown, because you aren't using -s (standalone). If you use -s, you'll get the metadata rendered, and you'll see this:

---
csl: 'https://github.com/greenelab/manubot-rootstock/raw/master/build/assets/style.csl'
nocite: '@*'
references:
- URL: 'https://arxiv.org/abs/1806.05726v1'
  archive_location: '1806.05726v1'
  author:
  - literal: 'Trang T. Le'
  - literal: Zach Osman
  - literal: 'D. K. Watson'
  - literal: Martin Dunn
  - literal: 'B. A. McKinney'
  container-title: arXiv
  id: 10GH4uYUR
  issued:
    date-parts:
    - - 2018
      - 6
      - 14
  title: Generalization of the Fermi Pseudopotential
  type: report
  version: 1
---

This is a direct markdown translation of your input. If instead you'd like rendered markdown citations, you need to tell pandoc to disable the citation extension for the output markdown, by doing -t markdown-citations. With +citation -- the default -- then pandoc assumes you're targeting a markdown format that supports pandoc-style citations, and it will just use those in the output instead of rendering. (This also explains why markdown_strict gives you rendered citations: markdown_strict doesn't enable the citation extension.)

With jats, the issue can also be resolved by adding -s. For jats, we can't just tag the bibliography onto the body, as we do in other formats, since it has to be outside the <body> element in a special <back> element. So we pass the bibliography to the template. If you don't use -s, the template is bypassed.

Hope that explains what you're seeing. Closing, because not a bug.

dhimmel commented 6 years ago

Thanks. So if I understand correctly, appending -citations to any format name would help me here. It would either have no effect (in case -citations was the default for that format), or would cause the citations to be rendered. In some cases, an additional configuration such as --standalone for JATS is required.

Is it possible to specify --to=-citations? Basically, say I want the output format to be inferred from the filename extension passed to --output, but for citations to be disabled as per -citations?

I guess what I'm deciding between is whether to give users of our command line utility access to all Pandoc output formats as options? Or whether the steps to get citations to render are finicky enough that I should just predefine a few hardcore the Pandoc options for a a few output formats such as docx, markdown, jats, and plain text.

One final question is with regards to "--to=plain". It seems to have converted the title to ALL_CAPS, which is perhaps it's attempt at bold? Italics were retained as markdown format like _arXiv_. Is there an option to create plain text that would be as if you copied rich text and pasted it into a plain text field. For example:

1. Generalization of the Fermi Pseudopotential
Trang T. Le, Zach Osman, D. K. Watson, Martin Dunn, B. A. McKinney  
arXiv (2018-06-14) https://arxiv.org/abs/1806.05726v1
jgm commented 6 years ago

Daniel Himmelstein notifications@github.com writes:

Thanks. So if I understand correctly, appending -citations to any format name would help me here. It would either have no effect (in case -citations was the default for that format), or would cause the citations to be rendered. In some cases, an additional configuration such as --standalone for JATS is required.

Is it possible to specify --to=-citations? Basically, say I want the output format to be inferred from the filename extension passed to --output, but for citations to be disabled as per -citations?

No, not possible.

I guess what I'm deciding between is whether to give users of our command line utility access to all Pandoc output formats as options? Or whether the steps to get citations to render are finicky enough that I should just predefine a few hardcore the Pandoc options for a a few output formats such as docx, markdown, jats, and plain text.

You don't need to use --output at all; you can just have your utility (and pandoc) output to stdout.

One final question is with regards to "--to=plain". It seems to have converted the title to ALL_CAPS, which is perhaps it's attempt at bold? Italics were retained as markdown format like _arXiv_. Is there an option to create plain text that would be as if you copied rich text and pasted it into a plain text field. For example:

Yes, plain (which mimics Project Gutenberg conventions) does use ALLCAPS for emphasis. If you don't want that, then one option is to use a pandoc filter to change all the Emph and Strong elements into plain text. This lua filter would do it:

function Emph(el) return el.c end function Strong(el) return el.c end

1. Generalization of the Fermi Pseudopotential
Trang T. Le, Zach Osman, D. K. Watson, Martin Dunn, B. A. McKinney  
arXiv (2018-06-14) https://arxiv.org/abs/1806.05726v1

-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/4834#issuecomment-412951762

dhimmel commented 6 years ago

Thanks @jgm. https://github.com/greenelab/manubot/pull/51 is now merged! We ended up hardcoding the output options because each options required a slightly different combination of arguments and minimum pandoc versions to succeed.

For example, when --to=plain, we use the suggested --lua-filter but only if pandoc version >= 2.

For some things, we did not know exactly what the minimum version required was. For example, in older versions of pandoc/pandoc-citeproc:

  1. specifying --csl=URL seemed to fail
  2. HTML files were blank besides a single <div>: https://github.com/greenelab/manubot/pull/51#pullrequestreview-147428216

Ideally, we want to provide an error message to the user if they attempt to use a feature that is not supported by their version of pandoc. Not sure if there's an easy way to find when a feature changed or was added. Combing through the changelog proved to be a bit challenging (as it contains so many changes!).