jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.02k stars 3.35k forks source link

Pandoc 2.11 adds HTML to rendered references when --to=markdown_strict #6921

Closed dhimmel closed 3 years ago

dhimmel commented 3 years ago

With pandoc 2.11.2 and the following pandoc command (run via bash):

pandoc --citeproc --to=markdown_strict --wrap=none <<< "
---
nocite: '@*'
csl: https://github.com/manubot/rootstock/raw/97b294802ffcd39071b6e5b8ab59f60faf4be118/build/assets/style.csl
references:
- id: f51SCNU1
  type: webpage
  title: test
...
"

outputs:

<span class="csl-left-margin">1. </span><span class="csl-right-inline">**test**</span>

Formerly with pandoc 2.9.2.1 and --filter=pandoc-citeproc rather than citeproc:

1\. **test**

We've been using markdown_strict for manubot cite markdown output because it did not include the HTML snippets. Is this regression intentional? Is there anyway to specify markdown output without these HTML fragments added to the bibliography?

dhimmel commented 3 years ago

As far as the numbered list change: 1. is preferable to 1\., so some parts of the new behavior are nice.

jgm commented 3 years ago

Raw inline HTML is part of original (strict) markdown. If you don't want it, though, you can specify -t markdown_strict-raw_html

jgm commented 3 years ago

By the way, the reason those spans are there is to get proper CSL block-level formatting. If you strip it out or ignore it, you'll lose some distinctions that the style requires.

If you strip it out, you will once again get the escape in 1\., because otherwise this would be interpreted as a markdown ordered list item.

dhimmel commented 3 years ago

Raw inline HTML is part of original (strict) markdown. If you don't want it, though, you can specify -t markdown_strict-raw_html

That removed the unwanted <span> elements. Thanks!

By the way, the reason those spans are there is to get proper CSL block-level formatting

Ah yes. That is something else I've noticed with the citeproc migration. We no longer have line breaks between CSL blocks with our existing style. For example, the plain text output looks like:

2. Honey bee sting pain index by body locationMichael L SmithPeerJ (2014-04-03) https://doi.org/gfrfbmDOI: 10.7717/peerj.338 · PMID: 24765572 · PMCID: PMC3994616

Rather than:

2. Honey bee sting pain index by body location
Michael L Smith
PeerJ (2014-04-03) https://doi.org/gfrfbm
DOI: 10.7717/peerj.338 · PMID: 24765572 · PMCID: PMC3994616

Is this something where we need to update our CSL style? Let me know if I should open another issue describing this more clearly?

jgm commented 3 years ago

Yes, take a look at the CSS in the current pandoc default template.

dhimmel commented 3 years ago

take a look at the CSS in the current pandoc default template

I found the following, which styles some of the CSL spans but not <csl-block>:

https://github.com/jgm/pandoc/blob/7199d68ba078148ff76a38f2c483da73edd62747/data/templates/styles.html#L156-L178

For HTML output, we could update our CSS to place csl-blocks on their own lines. But for --to=plain, --to=markdown_strict-raw_html, --to=docx, etcetera, editing the CSS won't have an effect right? So does that mean it's no longer possible to have newlines between components of a reference that applies to all output formats?

jgm commented 3 years ago

Note: in the AST, we represent the display styles using Spans, since the type is [Inline]. But the HTML writer will render these as divs, hence the rules for divs in the css.

No special style was added for 'block' because the default rendering of a div is fine for that.

But it looks as if for some reason we're not rendering the Span with class csl-block as a div in the HTML. I need to look into this. Example:

<div class="csl-left-margin">6. </div><div class="csl-right-inline"><strong>A6</strong> <span class="csl-block">John Doe</span> <em>Cambridge University Press</em> (2010) <a href="https://127.0.0.1/documents/Watson--paper.pdf">https://127.0.0.1/documents/Watson--paper.pdf</a></div>

To get the plain markdown output you want, you could use a filter that adds soft breaks before each Span with class csl-block -- or something like that. I might want to experiment with adding these soft breaks automatically for all formats, since this will produce nicer output outside of HTML/LaTeX.

jgm commented 3 years ago

OK, I see the bug in cslEntryToHTML (also ToLaTeX, ToDocx): it doesn't properly handle nested Spans with csl display attributes.

jgm commented 3 years ago

I've made some fixes to both HTML and LaTeX output; maybe you could try.

Btw, I'm not sure the way you're using the "block" display style is right; I think that after using the "block" for the author, you should include another block for the rest; otherwise the HTML doesn't look right. Maybe there's a way to fix this by changing CSS, I'm not sure.

jgm commented 3 years ago

I've added some newlines in the markdown output which should improve things for you.

dhimmel commented 3 years ago

Btw, I'm not sure the way you're using the "block" display style is right; I think that after using the "block" for the author, you should include another block for the rest

Yes, we alternated block display for every other line since otherwise references were double spaced. See https://github.com/manubot/rootstock/pull/346#issuecomment-640865080 and https://github.com/manubot/rootstock/pull/134. But we should revisit our style for the new citeproc.

One thing we did is create a document with all combinations of CSL JSON fields in https://github.com/manubot/manubot/pull/110. Then we could render it for a given CSL style and check the formatting was as expected.

Will hopefully do this soon for an updated style and report back.