jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.94k stars 3.35k forks source link

Specify command-line options using YAML metadata #4627

Closed mb21 closed 3 years ago

mb21 commented 6 years ago

I'm creating this issue to close the more specific ones that fall into the category of "can I specify command-line option X using YAML metadata?"

You could write a bash script, or use one of the following third-party tools that build on top of pandoc and already implement the approach described in below comments):

Update

Pandoc now supports a --defaults option, which can be used as follows to specify command-line options from a file. If e.g. input.md contains:

---
standalone: true
...

# my title

rest of my document

You can call pandoc as follows:

pandoc --defaults input.md input.md

Yes, currently you have to specify the file twice: once for the --defaults option to read out the YAML, and once as the markdown input file as usual. There's a follow-up issue for this.

mb21 commented 6 years ago

We could discuss whether its worth implementing this in pandoc itself. Possibly with a syntax like the following:

---
options_:
  - reference-doc: mydoc.docx
  - template: |
      `mytemplate.tex`{=latex}
      `mytemplate.html`{=html}
---

The syntax has to be valid YAML (therefore we need the | or potentially quotes around some values), and pandoc interprets the values as markdown (therefore we sometimes might have to wrap them in backticked code-spans to prevent nasty surprises).

The question is whether this approach is not more trouble than it's worth.

jtkiley commented 6 years ago

+1

I use pandoc for a lot of things, but one set is producing brief write ups, letters, and envelopes. These are quick one-off documents, but it's nice to have them formatted consistently and nicely. Currently, I specify what I can in YAML and use a custom template and engine on the command line. That's inconvenient for one-off documents, compared to an academic paper or something with an ongoing set of revisions (and, presumably, a Makefile).

For my use, the ideal scenario would be this:

  1. Set everything (including output type) in YAML. This would be basically be part of a template.
  2. Run a command as simple as pandoc document.markdown on the command line (or, better, using script in Atom).
  3. Done.

A bonus would be a way of specifying that the output filename should be the same with a different extension (e.g., document.markdown makes document.pdf without needing to specify the literal name document in YAML in each file).

iandol commented 6 years ago

pandocomatic and panzer already handle this with I think much more flexibility (within-document settings that combine / override a cross-document yaml det of defaults), but I imagine for simple uses this would certainly be used by some users who do not want to install any additional tools...

jgm commented 6 years ago

Jason Kiley notifications@github.com writes:

  1. Set everything (including output type) in YAML. This would be basically be part of a template.

If I understand the suggestion correctly, it's that the template syntax be expanded so that default values for command-line options can be specified in the template itself. Then you could do, e.g.

pandoc --template letter document.md

and it would use the appropriate template and settings, perhaps also producing an appropriately named output file as specified in the amplified template.

That's an interesting idea, and it makes much more sense to me than putting these option settings in the document's metadata. (The whole point of pandoc is that you can convert a single document to different formats, in different ways -- so including instructions about this in the document itself seems wrong.)

Alternatively, we could introduce the idea of an "options" file (which could also be put in your user data directory, as templates can). For example, the contents of ~/.pandoc/options/letter might be:

--template=letter --to=pdf --pdf-engine=xelatex --output="%.pdf"

Then you could do

pandoc --options letter myletter.md

and it would produce myletter.pdf using the template letter.latex. This gives a cleaner separation of concerns than putting this in the template, and the same ease of use.

(For a long time I've used a small shell script that does essentially the above, and this is still what I'd recommend for uses like yours, but creating this --options option would make things easier and not require users to create shell scripts.)

jtkiley commented 6 years ago

@iandol: Thanks! I'll give those a look.

@jgm: I suppose I was thinking of templates in two different senses. One is a markdown file that I would copy and use to create a new document. That's what I meant above. I use that pattern for things like envelopes where my envelope.tex template (the second kind) is expecting certain variable names to some in from the markdown file (which, incidentally, only has YAML content). For my use, it would be practically hard to eliminate the markdown template, as I'd have to memorize all of my variable names. Similarly, my letter and paper markdown templates include a number of YAML variables (controlling things like signature images). With that in mind, I was thinking of specifying command line options in that markdown template.

Your response also bring up an interesting difference in the use you design for and how I actually use it. You're suggesting a one-to-many relationship where a given document will routinely be converted to different output formats. My use is almost entirely one-to-one in that a given document is destined to end up in only one output format (though what format that is differs by document). That's why the command line options in the markdown file make more sense to me (but perhaps not to you): I almost never do anything other than a single output format for any given document.

So, perhaps the root of my suggestion is making the one-to-one workflow a first class use case.

I should note that the options idea would seemingly work pretty well on one computer, but I think it would be more complex for multiple computers (e.g., dotfiles repository, symlinks, Dropbox, or some combination), and it would make sharing the input document harder than just sending over a markdown file and a .tex template for someone else to edit/run. But, then again, I'm assuming my one-to-one workflow.

Sorry for the wall of text, and thanks again for thinking about this. It would be a big improvement for what I do.

jgm commented 6 years ago

Jason Kiley notifications@github.com writes:

@jgm: I suppose I was thinking of templates in two different senses. One is a markdown file that I would copy and use to create a new document. That's what I meant above. I use that pattern for things like envelopes where my envelope.tex template (the second kind) is expecting certain variable names to some in from the markdown file (which, incidentally, only has YAML content). For my use, it would be practically hard to eliminate the markdown template, as I'd have to memorize all of my variable names.

Since you can set variables from the command line, these could be set with the 'options' files I was proposing.

Similarly, my letter and paper markdown templates include a number of YAML variables (controlling things like signature images). With that in mind, I was thinking of specifying command line options in that markdown template.

Yes, all of this can be set on the command line as well, and hence could go into an 'options' default.

I should note that the options idea would seemingly work pretty well on one computer, but I think it would be more complex for multiple computers (e.g., dotfiles repository, symlinks, Dropbox, or some combination), and it would make sharing the input document harder than just sending over a markdown file and a .tex template for someone else to edit/run. But, then again, I'm assuming my one-to-one workflow.

There's a basic conceptual problem with putting all of this option stuff in the md file itself: we need at least some options to be settled before we even know how to read the file. We have to know that the input format is markdown and that yaml_metadata is an enabled extension.

That's why it makes more sense to me to have a separate options file.

mb21 commented 6 years ago

If I understand the use-case of @jtkiley correctly, it's exactly about bundling everything (including the options) in one single, portable, file. Which is exactly what panzer does:

panzer adds styles to pandoc. Styles provide a way to set all options for a pandoc document with one line (‘I want this document be an article/CV/notes/letter’).

You can think of styles as a level up in abstraction from a pandoc template. Styles are combinations of templates, metadata settings, pandoc command line options, and instructions to run filters, scripts and postprocessors. These settings can be customised on a per writer and per document basis. Styles can be combined and can bear inheritance relations to each other. panzer exposes a large amount of structured information to the external processes called by styles, allowing those processes to be both more powerful and themselves controllable via metadata (and hence also by styles). Styles simplify makefiles, bundling everything related to the look of the document in one place.

[...]

Styles are defined in a yaml file (example). The style definition file, plus associated executables, are placed in the .panzer directory in the user’s home folder (example).

A style can also be defined inside the document’s metadata block:

I'm guessing some people use make-files for this. But if you're coming from the world of GUIs and word processors, it would sound simpler to bundle up everything in one file and then run the export-to-PDF and export-to-HTML commands in your editor (say, Atom), and it would read all the options from the file metadata.

jtkiley commented 6 years ago

@mb21: Yeah, the ease and portability are a big part of it. That said, I'm going to rework some of my stuff using panzer to try it out. It look like it would cover a lot of my individual friction points.

I do use Makefiles for my heavily-edited, version-controlled documents (usually academic papers), but I have plenty of things that are either one-off or at least more casual. It would be nice for those things (all of the templating included) to sync around to different computers easily and be easy to distribute to others. I can personally manage the complexity, but it does make collaboration harder, especially with people who typically use GUIs/Word (to be fair, nearly everyone else in my field). There's a payoff in automating low value-added work like citations or document-level presentation, but there's a complexity cost in installing, setting up, and using a workflow like this, and it would be nice (from my perspective) to put a dent in those costs.

It'll probably be a few days, but I'll circle back here once I try panzer. I know it's an n of 1, but do let me know if some specific examples would help. I can share some when I have a change to dig in with panzer.

jgm commented 6 years ago

Mauro Bieg notifications@github.com writes:

If I understand the use-case of @jtkiley correctly, it's exactly about bundling everything (including the options) in one single, portable, file. Which is exactly what panzer does:

Yes, I understand that. I'm not too keen on building that into pandoc, for reasons given. But the 'options' idea I floated above still seems worth while to me. You could define packages of options for common uses, e.g. letters, and use those easily for one-off documents. True, you'd have to remember more than just pandoc -- you'd have to tell it to process the thing as a letter or whatever -- but that seems ok to me. I'd be curious whether anyone thinks this would be a useful feature.

jtkiley commented 6 years ago

I tried panzer, and it's not really helpful for my case. First, of the three options I'd most like to specify inside the file (i.e. pdf-engine, template, and output), it only supports pdf-engine. So, I wouldn't really be saving much on the command line, and it wouldn't help with the friction with one-off documents, as I can't see a good way of automating running that command.

The options route would be a start, but it doesn't seem to help with automation. The really awesome outcome for me would be setting up a grammar for Atom using script. Then it's just a keyboard shortcut to produce a PDF, regardless of type.

I know that you have reasons for not wanting it in pandoc (though I do still hope to persuade you otherwise), but I really wish there were a way to streamline these kinds of uses. For things like envelopes, 90 percent of the work is making the PDF, not entering the address. It seems like that shouldn't be the case, whether it's supported within pandoc itself or something external.

jgm commented 6 years ago

Jason Kiley notifications@github.com writes:

The options route would be a start, but it doesn't seem to help with automation. The really awesome outcome for me would be setting up a grammar for Atom using script. Then it's just a keyboard shortcut to produce a PDF, regardless of type.

If you're happy writing a script, then there's no problem as things stand. You can write a script to call pandoc with exactly the options you want. Why not just write this atom script and be happy with an awesome outcome?

mb21 commented 5 years ago

I’ve found myself coming back to this issue.

There's a basic conceptual problem with putting all of this option stuff in the md file itself: we need at least some options to be settled before we even know how to read the file. We have to know that the input format is markdown and that yaml_metadata is an enabled extension.

I can see how it would be weird for pandoc to first naively parse the YAML metadata of the input markdown file without parsing the values as markdown, read out the options, and then re-parse the whole file using the specified options. It could be done, but architecturally it would be a weird thing to do for pandoc. But it would be useful.

So I wrote a simple script (~100 lines) that does exactly that: panrun.

The motivation is really that for one-off documents, I want to save the necessary pandoc options right in the file. (Just like rmarkdown users can simply open the file and hit that 'convert' button.) I don’t want to remember which document-class/style/theme I had decided to convert this document with. I don’t want to litter my filesystem with runpandoc.sh or template.html files for each one-off document. Finally, I didn’t want to “parse” YAML with sed, or use a complex tool that only works for certain options.

Anyway, I’ll see whether panrun serves me well. Let me know how it works for you: panrun/issues ;-)

SylvainGuieu commented 5 years ago

My option, for the template only was to use a pre-extention on the file name so a filename.letter.md tells my Makefile to look for a letter.tex template or letter.html template file to run pandoc. This work well for me because it allows me to see the main kind of md file i have in a directory : *.tech note.md for thecnical notes, *.meeting.md for meeting minutes *.letter.md for letters etc... Each produce standardised documents by type. For html, a css can also be included with the template in the same way.

The target assignment on my make file looks like:

$(OUTPUT)/%.letter.pdf : $(SOURCE)/%.letter.md
    $(PANDOC) $(PANDOC_OPTIONS) --template /path/to/templates/letter.tex $(PANDOC_PDF_OPTIONS) -o $@ $(PANDOC_HEADERS) $< $(PANDOC_FOOTERS)

This is easy to script also in a bash file.

jtkiley commented 5 years ago

Thanks all for the ideas. I adapted some of the ideas here into a form that accomplishes most of what I want, and I've been successfully using it for a couple of months.

I created a directory hierarchy where each template type has a directory with a Makefile that uses wildcards to process a markdown file with the appropriate LaTeX template to produce the requested target. So, for a new letter, I copy a markdown template, edit, and then make 20190609_example.pdf to get the typeset version. Then, once I'm done (e.g., printing, uploading, emailing), I move the markdown and pdf to a _completed subfolder.

It works well for one-off letters (usually recommendations) and envelopes. My main projects already have Makefiles, so this wasn't an issue for those. It's a little less convenient for things in the middle of one-off and projects, like a document that should be grouped with other files but isn't something that I would version control. Those are rarer for me, so they have less friction than the one-off documents, though. I do not yet have a good way of automating pandoc in a text editor, but perhaps that is a future project.

I do still hope this is eventually implemented, but I appreciate the help here in helping me think through a good way to address most of the friction.

mb21 commented 5 years ago

I do not yet have a good way of automating pandoc in a text editor

My PanWriter supports pandoc export, options are read from the document's YAML.

bpj commented 4 years ago

My rather primitive take on default options is a Perl script which looks for a file ~/.runpandocrc, ./.runpandocrc or ./runpandocrc, slurps it and splits it into a list of "words" with Text::ParseWords (using the regex (?:\s+|\#.*) as delimiter so as to allow line comments) and then invokes pandoc with this list prepended to the commandline. It has some options of its own to read in options from additional/alternative files and intercepts the --from --read -f -r and --to --write -t -w options and the -M and -V options in order to allow setting and unsetting extensions separately from formats on the command line or in the file, and to allow unsetting Metadata and variables from the file via a home-cooked syntax with --rx +=EXTENSION and the like, but mostly it just passes the command line on to pandoc. This at least has the advantage that it doesn't really require a new syntax.

mb21 commented 4 years ago

Interestingly, with the new --defaults option (currently in the nightly builds, to be released with pandoc 2.8), we almost sort of got this. I was expecting that with this in foo.md:

---
standalone: true
---

# test file

you could run:

pandoc --defaults foo.md foo.md

But currently this fails with unexpected multiple YAML documents, probably because of the Y.decode1 in the source code. Maybe this could be changed to Y.decode and simply take the first one?

jgm commented 4 years ago

@mb21 that's a nice trick, but I think it's going to cause too many problems if we allow that.

  1. It would only work if the first YAML block in the markdown file contains only fields --defaults knows. Otherwise an error would be raised.
  2. All of these fields would go into the document's metadata, and might come out e.g. in meta tags, but they're not metadata.
  3. The fields would be parsed as markdown (perhaps harmless).

One idea I've toyed with is allowing something like:

---
defaults_:
  standalone: true
  columns: 78
# now comes the real metadata
title: Foo
...

if we taught --defaults to check the YAML for a defaults_ field and use it if present, this might work. Note that, as documented, YAML metadata fields ending in _ aren't included in metadata or parsed as markdown.

jgm commented 4 years ago

Or maybe we could tell pandoc not to parse a YAML metadata section with anchor defaults:

---
&defaults
standalone: true
columns: 78
...

That looks clean.

jgm commented 4 years ago

See https://github.com/haskell-hvr/HsYAML/issues/39 for a blocking issue (though we could manually crop the input if necessary).

mb21 commented 4 years ago

Yes, my attempt was definitely a hack.

I don't think a lot of people know about YAML anchors and it will unnecessarily confuse them. But I like having the options as a subfields (e.g. under defaults_:). Maybe defaults_ is not the most descriptive name though, what about something like options_ or output_?

mitinarseny commented 4 years ago

I don't know if this is relevant to this discussion or has been discussed before (a lot of text here), but it would be really convenient and meaningful if YAML block in .md document can contain variable which specifies extensions and other options, that should be used by default to process current document. For example, here is contents of example.md.

---
title: Document with latex macros
_defaults:
  extensions:
    - +latex_macros
  output:
    html:
      katex: true
---

\providecommand{\mathFunc}[4]{#1\left#2\, #3 \,\right#4}
\providecommand{\mathbbFunc}[4]{\mathFunc{\mathbb{#1}}{#2}{#3}{#4}}
\providecommand{\mathrmFunc}[4]{\mathFunc{\mathrm{#1}}{#2}{#3}{#4}}
\providecommand{\Prob}[1]{\mathbbFunc{P}{(}{#1}{)}}
\providecommand{\Expect}[1]{\mathbbFunc{E}{[}{#1}{]}}
\providecommand{\Var}[1]{\mathrmFunc{Var}{[}{#1}{]}}

# Normal Distribution
Here is the definition of Normal Distribution
$$\begin{gathered}
    \left\{ \eta \sim N(\mu, \sigma^2) \right\}\\
    \Updownarrow\\
    \left\{\begin{gathered}
        F_\eta(x) = \Prob{\eta < x} = \int_{-\infty}^{x} f_\eta(x)dx,\\
        \text{where} f_\eta(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
    \end{gathered}\right\}
\end{gathered}$$

## Expected Value

$$\boxed{
    \Expect{\eta} = \mu
}$$

## Variance

$$\boxed{
    \Var{\eta} = \sigma^2
}$$

Here I define some commands with \providecommand. It makes sense to list extensions within document as it USES them and it would translate to inappropriate output if latex_macros is not enabled. katex: true means that --katex option should be enabled by default when exporting to html document type. When I write this example.md I test it with KaTeX and in most cases I will use it for future exporting. So, instead of writing

pandoc example.md --from=markdown+latex_macros --katex -o example.html

I would simply write:

pandoc example.md -o example.html

And get following output in browser:

example.html

While was writing this comment, I eventually found out that latex_macros in enabled by default (see pandoc --list-extensions). But the same thing could be applied to hard_line_breaks.

Another solution would be to move some of extensions (latex_macros, hard_line_breaks), which are associated with way of writing (not translating) .md document, to variables, so that they can be set from YAML block within .md file. I find this rather more logical, but I am not sure if I fully understand reasons, why they are extensions and not variables.

P.S. I'd like to thank so much everybody who contribute to Pandoc! I recently discovered it and now I am happily using it for my academic papers in uni and try to launch blog based on Pandoc and GitHub Pages.

mb21 commented 4 years ago

@mitinarseny yes, this is exactly what this issue is about :) (see the first post)

narg95 commented 4 years ago

+1

mitinarseny commented 4 years ago

It will be very useful if yaml metadata block could also contain filters: [filter1, filter2] that are needed to be applied by default to this document in corresponding order. —filter filter3 cli option should append filter to filters declared in yaml. And —no-yaml-filtersoption that will cancel usage of these filters would be useful, too.

mb21 commented 4 years ago

@mitinarseny see also #5870

kysko commented 4 years ago

I hope this is the right place for these two comments:

Order of options on command line

How does the -M or --metadata option play into this? It seems to depend on the order on the command line.

Say I have the following markdown (a.md), default yaml (d0.yaml) and command line:

# Foo
standalone: true
setext-headers: false
metadata:
  author: me
pandoc a.md -M test1 -d d0 -M test2 -o a_result.md

Then we have the result a_result.md:

---
author: me
test2: true
---

# Foo

So if an -M is placed before -d d0, it is ignored if there's a metadata option in the default, even if the latter doesn't have that particular metadata key. When the metadata lines are removed in d0.yaml, both tests come out. If this is the expected result, perhaps a few words in the manual would be good.

However, when putting test: true in a standalone metadata file, the result is as expected (not ignored) whether it is put before or after -d d0.

atx/setex options

Since --defaults was described as a way to "specify a package of options", I began by inserting atx-headers: true in the above default yaml, but got an error. Checking the example, I saw it should be setext-headers: false instead. Yet, I see no --setext-headers option for command line in the Manual. Not a problem, I just wonder why it doesn't reflect the existing --atx-headers option, for consistency.

jgm commented 4 years ago

@kysko these useful comments should go in a separate issue, as they don't concern the feature under discussion here, but rather the behavior of the --defaults option.

kysko commented 4 years ago

Done. Sorry, I thought this was the issue of origin leading to --defaults

ghost commented 4 years ago

+1. I would really like this to be implemented. Panzer is no longer being developed because most of its functionality is now integrated into Pandoc itself. Even though Panrun and Pandocomatic are still active, removing the dependency on external tools would be nice.

bpj commented 4 years ago

https://pandoc.org/MANUAL.html#default-files

OK not in the document metadata but good enough for me!

lyndondrake commented 4 years ago

Now that we have --defaults, is there any chance that there might be a default --defaults file? Where this would help is other tools which invoke Pandoc (e.g. Hugo) but where it's impossible to change the command line passed to Pandoc.

jgm commented 4 years ago

I'd worry about the security implications of a default defaults file. But this shouldn't be discussed here -- use pandoc-discuss.

jgm commented 4 years ago

@lyndondrake for your use case why not create a shell script that passes on arguments to pandoc and includes some new ones? Name it pandoc and put it in your path before real pandoc, so Hugo will use it.

alerque commented 4 years ago

@jgm I can think of several reasons that is a bad solution. It's a hack that could work, but not a solution. First, it would not be project specific and would break other projects unless you did some very creative hacking with env and path variables. And even if you did, catching things in the PATH before system paths is a bad idea for many reasons and strongly discouraged by most sysadmins. Whatever you did to hack that in inevitably wouldn't be portable and wouldn't map well to use in CI runners, etc.

lyndondrake commented 4 years ago

I'd worry about the security implications of a default defaults file. But this shouldn't be discussed here -- use pandoc-discuss.

Apols - I'll take it across there.

brainchild0 commented 4 years ago

Is this issue fully succeeded by #5790 and #5870?

tarleb commented 3 years ago

I think @brainchild0 is right, and remaining issues should be discussed in #5870.

hoclun-rigsep commented 2 years ago

The pandoc -d doc.md doc.md approach described in the top post here fails for me on a recent version with "Multiple YAML documents encountered."

jgm commented 2 years ago

@hoclun-rigsep I believe this is due to our switch from HsYaml to yaml for YAML parsing. I may be able to add some code to restore the former behavior.

jgm commented 2 years ago

OK, this should work again after 0d1ba3dce33b8d5d30d7cf8febfa8ea3060b5dfd

mboyea commented 2 months ago

The pandoc -d doc.md doc.md approach described in the top post here fails for me on a recent version with "Multiple YAML documents encountered."

This is failing for me in today in Pandoc 3.2 as well.