The Panodc manual is sparse and even plain wrong about `header-includes`

stroobandt commented 4 years ago

pandoc 2.5
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.7.7

Recently, I experienced the Pandoc manual to be severely lacking and especially confusing on the subject of header-includes.

First of all, I was profoundly surprised to see that LaTeX macros defined in header-includes: (without any further {=latex} specification) also affect MathJaX in HTML output. When reading the manual, one would think LaTeX header-includes would only affect latex output. In all, this is a useful feature, but is not as such documented.

Inserting header-includes: inside a YAML metadata block inside the input document is easy enough. However, in my eternal quest to separate format from content, I wanted to achieve exactly the same using a makefile and an external file. I was hoping -H FILE would do that, as is suggested in the manual. That did not work. The Pandoc manual happens to be plain wrong about this!

After spending more time than intended trying out many more things, I was lucky to eventually run into Boilerplating Pandoc for Academic Writing. This article explains how easy it is to load header-includes from an external file by letting it precede the input file.

I also wrote my findings in this [TeX StackExchange answer(https://tex.stackexchange.com/a/566707/26348).

the-solipsist commented 4 years ago

That did not work. The Pandoc manual happens to be plain wrong about this!

Why? What went wrong? -H has previously worked for me (though I no longer use it).

Trial-and-error led me to figuring out that a latex-headers.yaml file could be used as an input markdown file, in which case it would be parsed as markdown (not literal) text, but could be marked as latex ({=latex}) using the raw_attribute extension. I describe this more fully in these two posts on pandoc-discuss:

https://groups.google.com/g/pandoc-discuss/c/8yHmWry4vv8/m/ImOTF2KPAwAJ (newer, more detailed)
https://groups.google.com/g/pandoc-discuss/c/qsPby_AnO1U/m/ti-ZA6ucCAAJ (older, less detailed, but provides context in which the problem arose for me)

P.S., since pandoc 2.8 (2019-11-22), you can use --defaults to help with the separation of format from content.

mb21 commented 4 years ago

When reading the manual, one would think LaTeX header-includes would only affect latex output.

Yeah, I can see how you might be led to believe that... maybe we should add a section where generic variables are listed, in addition to the "Variables for HTML", "Variables for LaTeX", etc. which we already have...

so yeah, if you could explain what was actually "plain wrong" in the manual, we should quickly be able to fix it... pull requests are always welcome as well..

tarleb commented 4 years ago

This goes back to a discussion on a Lua filter PR. While I agree that the manual could be more explicit on this, I strongly disagree with the "plain wrong"; I thought I had explained the underlying issues in the linked discussion.

jgm commented 4 years ago

Maybe you could explain in what way -H didn't work? it should do exactly what is advertised, namely include literal content from the file in the document header. If you were expecting the content to be parsed as Markdown, then it would not work, but that is not what the documentation suggests, is it?

the-solipsist commented 4 years ago

A few notes on things a newbie (such as myself) might find less than clear in the manual:

"parsed as literal string text" vs. "parsed as raw content" vs. "parsed as markdown" For someone who doesn't already know the significance of these differences, just stating how something will be parsed doesn't mean much. (Most people will understand that symbols like *text*, etc., won't work in the "parsed as a literal string", but not necessarily what it means for LaTeX/HTML code, especially when used in pandoc markdown which seems to understand LaTeX/HTML.) Same goes for, "string scalars in the YAML file will always be parsed as Markdown". There is another possible vector for confusion, which is that "literal" in "The pipe character (|) can be used to begin an indented block that will be interpreted literally" seems to mean something slightly difference from "literal" in "metadata values specified here are parsed as literal string text, not markdown".
"Raw content" In one part of the manual, it says: "Raw content to include in the document’s header may be specified using header-includes; however, it is important to mark up this content as raw code for a particular output format, using the raw_attribute extension), or it will be interpreted as markdown." There, the following example is provided:
```
header-includes:
- |
```{=latex}
```
However, in another part of the manual, the following example is provided:
```
header-includes: |
\RedeclareSectionCommand[
```
There are two differences: (1) The use of raw_attribute extension {=latex}, and (2) The header-includes: | vs. header-includes:\n-|. It isn't clear whether and what significance these two differences have, particularly given the impression provided by the manual that "it is important to mark up this content as raw code".
In one format or in all formats? Initially, I used to believe that if I included some HTML mark-up in a markdown file, I could convert that into whichever format I wanted (such as PDF). But the manual clarifies that raw HTML only gets converted into a few formats, and not all. Similarly, for raw LaTeX, the manual states: "Inline LaTeX is ignored in output formats other than Markdown, LaTeX, Emacs Org mode, and ConTeXt." However, it is then something to be learnt that the raw LaTeX included in headers is included in all formats. (Note: It is clear for those who understand that "inline LaTeX" is different from LaTeX in headers. But people like me might not understand that unless it is pointed out in the section on headers.)

I hope that helps.

jgm commented 4 years ago

I think the underlying issue is that people don't really understand the conceptual difference between variables and metadata fields. This is a natural confusion, because variables get set automatically from metadata fields. header-includes as a metadata field gets parsed as Markdown (before the like-named variable gets set), while header-includes as a variable just gets passed through without any modification at all.

Another issue that trips people up is that metadata fields set in documents behave differently from metadata fields set on the command line or via defaults files. In the former case, they are parsed as Markdown (or whatever the document's format is); in the latter case, they are interpreted as plain text -- which is not the same as simply passing them through verbatim to the output, since the text may need escaping appropriate to the output format.

The model is thus somewhat complicated, and it's easy to get pretty far into using pandoc without understanding it.

stroobandt commented 4 years ago

First of all, my sincere apologies for the caused commotion. This was not my intention.

Here is a reconstruction of what happened:

@tarleb taught me how LaTeX macros in header-includes in a YAML block at the beginning of a document also have an effect on MathJax formulas in html output. That already came as a surprise to me.

In my eternal quest to separate format from content, I wanted to achieve the same with referencing an external .yaml file in my makefile. Hence, I consulted the manual.

Since there is no header-includes entry in the table of contents, I performed a Ctrl+F search on this term. This yielded four hits:

The first hit under "Variables for LaTeX" is a handy LaTeX example in a YAML block at the beginning of a document.
The last two hits under "Metadata blocks" constitutes the actual definition of header-includes:

Raw content to include in the document’s header may be specified using header-includes; however, it is important to mark up this content as raw code for a particular output format, using the raw_attribute extension), or it will be interpreted as markdown. For example: […]

It was the second hit under "Variables set automatically" what profoundly confused me:

Variables set automatically

Pandoc sets these variables automatically in response to options or document contents; users can also modify them. These vary depending on the output format, and include the following: […] header-includes contents specified by -H/--include-in-header (may have multiple values)

Trying to include the LaTeX macros using -H is what did not work for me in the same way as it did with the YAML at the beginning of the Markdown document.

Now, I have to admit I was very tired when reading that second hit initially. Reading it now and with the hindsight and knowledge of the comments posted above, I can kind of see that: either I made a logical error (which I probably did) or there is semantic overloading of the term header-includes or both.

The variable header-includes being set automatically by -H and being user modifiable is definitely not the same as setting header-includes through -H. Nonetheless, the latter is what my tired mind was thinking at that moment.

Luckily, there was this external web article Boilerplating Pandoc for Academic Writing which helped me on my way, there where the Pandoc manual did not.

I sincerely do hope that this erratic mind path will contribute towards improving the manual on the subject of header-includes. After all, this is why I opened this issue.

mb21 commented 4 years ago

I think the underlying issue is that people don't really understand the conceptual difference between variables and metadata fields.

agreed, I think the only way to communicate that somewhat understandably is with a table...

stroobandt commented 4 years ago

@mb21 Your proposal goes a long way in clarifying things and highlighting the differences between the concepts of --variable and --metadata. The filter section does a good job of better promoting Lua filters, as I previously complained about in this Lua filter issue #121.

A couple of remarks, though:

This may sound inessential, but in our Western left-right reading paradigm, I think your table would work even better if it were transposed.
pandoc -t native gives the impression of being some help function, similar to pandoc -h or pandoc -D FORMAT. pandoc -t native FILE would work better here.

However, there still remain a number of treacherous mind traps in the Pandoc manual:

subsubsection "Metadata variables"
subsubsection "Metadata blocks", which actually contains the definition of header-includes at the very end

In view of what has been discussed here, the word choice of these two subsubsections in the manual is really unfortunate and ads to the confusion. I would suggest renaming them "Bibliographic variables" and "Bibliographic blocks".

Furthermore, what is the actual definition of header-includes doing there at the end of "Metadata blocks"? That when the preceding deals exclusively with bibliographic variables. This is calling for confusion.

Finally, a lesser issue is that the only information about --metadata remains somewhat hidden in the Reader options subsection, whereas an entire subsection is devoted to Variables. Perhaps --metadata also deserves a brief subsection, to put it on par with variables. That would help in underlining the differences between the two concepts.

jgm / pandoc

The Panodc manual is sparse and even plain wrong about `header-includes` #6757

Variables set automatically