jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.98k stars 3.34k forks source link

Add a manual (or manual section): Customizing Pandoc #3288

Open jgm opened 7 years ago

jgm commented 7 years ago

Topics:

crmackay commented 7 years ago
jgm commented 7 years ago

Thanks for the suggestion, I've added it to the list.

ickc commented 7 years ago

Suggestion: subtopics for filters: AST. Also see #3262.

Example:

Recently I knew one of the undocumented feature of the AST regarding RawInline/RawBlock. I try to make some sense out of it by doing the following tests

# Raw LaTeX
$ printf "%s\n\n" '\LaTeX' | pandoc -f markdown -t native
[Para [RawInline (Format "tex") "\\LaTeX"]]
$ printf "%s" '[Para [RawInline (Format "tex") "\\LaTeX"]]' | pandoc -t markdown -f native
\LaTeX
$ printf "%s" '[Para [RawInline (Format "latex") "\\LaTeX"]]' | pandoc -t markdown -f native
\LaTeX
$ printf "%s" '[Para [RawInline (Format "beamer") "\\LaTeX"]]' | pandoc -t markdown -f native  

$ printf "%s" '[Para [RawInline (Format "beamer") "\\LaTeX"]]' | pandoc -t beamer -f native
\begin{frame}

\end{frame}
# Raw HTML
$ printf "%s" '[RawBlock (Format "html") "<html>"]' | pandoc -t markdown -f native
<html>

$ printf "%s" '[RawBlock (Format "html5") "<html>"]' | pandoc -t markdown -f native

$ printf "%s" '[RawBlock (Format "html5") "<html>"]' | pandoc -t html5 -f native

$ printf "%s" '[RawBlock (Format "markdown") "\\LaTeX"]' | pandoc -t markdown -f native
\LaTeX
$ printf "%s" '[RawBlock (Format "markdown") "\\LaTeX"]' | pandoc -t latex -f native

What we see here:

tajmone commented 5 years ago

Definitely, documentation on pandoc's AST is something I've been longing for a long time.

Not for using filters, but to be able to create an application that generates a valid pandoc document in JSON AST format.

For example, I've created a feature request for an Asciidoctor to pandoc AST backend:

But haven't managed to find a document that lists all of pandoc's AST supported elements.

mb21 commented 5 years ago

@tajmone per definition (pun intended), see https://github.com/jgm/pandoc-types/blob/master/Text/Pandoc/Definition.hs#L95

jgm commented 5 years ago

Re-opening until we resolve some of the TODOs, but we now have a start on this in doc/customizing-pandoc.md.

MattDodsonEnglish commented 1 year ago

But haven't managed to find a document that lists all of pandoc's AST supported elements.

When I was looking at the filter docs (related: https://github.com/jgm/pandoc/issues/8750), I start wondering where I'd find the AST elements, and I started looking for the reference. That led me to this issue.

I think I'm going to try to make it a hobby project to put together some AST docs. I can start by looking at the code in src/Text/Pandoc/Definition.hs

jgm commented 1 year ago

Isn't Text.Pandoc.Definition's haddock page a good canonical reference for the AST elements? I'm curious what more you think would be needed in the way of documentation.

tarleb commented 1 year ago

One of the downsides of the Haddock page is that is contains a lot of info and can be overwhelming. E.g., I'd guess that it's not immediately obvious to a user unfamiliar with Haskell that the list of instances can be skipped at first reading, but that the constructors are important: instances take up half my screen when I load that page, while constructors are just two lines.

tajmone commented 1 year ago

I'm curious what more you think would be needed in the way of documentation.

I totally agree with @tarleb on the current docs being overwhelming to non-Haskell users.

AST Spec as JSON/YAML Document within Pandoc Binary

Ideally, I'd love to see that pandoc would include a JSON or (YAML) file with the full AST specification (node name, type, attributes, etc.). If pandoc could auto-generate this JSON/YAML file (either within the source repository, or directly from the pandoc executable binary) and then provide a CLI command to emit it (e.g. --print-ast-spec) it would make life very easy, since it would be available without having to surf the web for that info.

The reason I think a JSON or YAML file would be better (i.e. rather than a markdown doc, etc.) is that while these formats are both is human-friendly enough to be consulted as they are, they can also be easily manipulated to create ad hoc documents by parsing them and rendering them in whatever format one prefers (e.g. via Mustache templates). And, with the JSON/YAML spec being included within the executable, it would be very simple to setup any pandoc-related project to simply update the AST reference documentation with each new pandoc release by parsing it and re-generating the document via automated scripts.

As an example of how this might work, the PML (Practical Markup Language) tool does this by exporting it's document tags as a JSON file via the export_meta_data CLI option. The generated JSON file looks like this:

PML JSON AST ```json { "pml_meta":{ "pml_version":"4.0.0", "pml_release_date":"2023-02-23", "nodes":[ { "id":"admon", "type":"block", "title":"Admonition", "description":"A labeled piece of advice, such as a note, tip, warning, etc.", "examples":"[admon\n [alabel Tip]\n Later you'll see some [i striking] examples.\n]", "attributes":[ { "id":"id", "type":"id or null", "required":false, "default_value":"null", "position":null, "title":"Node Identifier", "description":"A unique identifier for the node.\n\nAn id can be used to:\n- identify a node so that an internal link can be done with an 'xref' (cross reference) node.\n- identify a node so that it can be styled individually with CSS\n- create an HTML anchor so that it can be accessed with the # (hash) sign (e.g. writing id=foo will enable you to have an HTML link ending with #foo.\n\nAn identifier must start with a letter or an underscore (_), and can be followed by any number of letters, digits, underscores (_), dots (.), and hyphens (-).. Note for programmers: The regex of an identifier is: [a-zA-Z_][a-zA-Z0-9_\\.-]*. Identifiers are case-sensitive. The following identifiers are all different: name, Name, and NAME.\n", "examples":"id = basic_concept" } ], "HTML_attributes_allowed":true, "is_inline_type":false, "is_raw_text_block":false, "child_nodes_allowed":true, "opening_tag":"[admon", "latest_doc_url":"https:\/\/www.pml-lang.dev\/docs\/reference_manual\/index.html#node_admon" }, ```

Here's an example project where I create different AST spec documents from the JSON files using Mustache templates to create markdown, AsciiDoc and plain-text file by manipulating the JSON info to create different documents by filtering specific keys and values:

https://github.com/tajmone/pml-playground/tree/main/mustache

this quickly allows me to always have updated spec docs on PML nodes/AST whenever PML is updated, in an automated way.

So, something along those line would work for the pandoc AST too (IMO), allowing end users to represent the final spec document whichever way they prefer, thanks to the JSON/YAML spec being always available (as a single document) via the pandoc binary itself.

jgm commented 1 year ago

If pandoc could auto-generate this JSON/YAML file (either within the source repository, or directly from the pandoc executable binary) and then provide a CLI command to emit it (e.g. --print-ast-spec) it would make life very easy, since it would be available without having to surf the web for that info.

+1 on autogenerating. What I want to avoid is a manually produced document that could get out of sync. [I guess it should be possible, because everything is a Generic and Typeable instance.]

MattDodsonEnglish commented 1 year ago

Maybe for an "evergreen" strategy, it'd be possible to have:


Note: I typed this before I saw the previous two responses in this thread, but @tarleb and @tajmone covered a lot of what I was thinking too. Here's the experience of a someone with only rudimentary programming knowledge and zero Haskell.

Isn't Text.Pandoc.Definition's haddock page a good canonical reference for the AST elements? I'm curious what more you think would be needed in the way of documentation.

Oh, interesting. I actually had scanned that but somehow didn't associate what I was looking at with a document structure. I might have understood if I'd opened the Block element, but the first element was Pandoc, which was too abstruse for me. Maybe I expected a tree/JSON representation, maybe with a little diagram of nodes (like how MDN represents the DOM).

So, besides that it may start at too sharp a grade, I can't comment on that doc's usefulness since I haven't used it. :-) I'm going to experiment with using that as a reference to make some Lua filters and see how I do.

I can speak of my experience as a Pandoc user at the lower end of technical proficiency, though. Basically, I'm only interested in transformation at the highest level: I just want a nice set of examples, a reference that I could look at, and maybe some example "lorem ipsum" style docs that I could inspect with pandoc -t native.

I don't know Haskell at all, so I can say that something like this walk :: (Block -> Block) -> TableFoot -> TableFoot wasn't understandable to me. I suppose the reference examples assume Haskell knowledge: considering that Pandoc filtering is polyglottal, should that be necessary? Maybe a complementary, language-agnostic doc would make it easier for a wider audience to understand how to manipulate document structures.

But, again, once I dig into those docs a bit, the reference will probably make more sense. I think Pandoc casts a pretty wide net (the Getting Started docs explains what pwd is), so I just posted this to document the experience of a reader with "fresh eyes."