jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.78k stars 3.39k forks source link

Markdown Writer RFE: add CL option to control format of code blocks #5280

Open rose00 opened 5 years ago

rose00 commented 5 years ago

In our workflows on the Java spec we prefer ATX-style headers and backtick-fenced code blocks. The default setting of the writer produces setext headers (for levels 1 and 2) and indented code blocks (when there are no block attributes).

The option --atx-headers lets us get the headers we want but there's no way to ask the writer to prefer backtick blocks. A command line option --unindented-code-blocks for the writer would do the trick for us.

See also https://github.com/jgm/pandoc/issues/2120 which documents the decision not to express this as a subtracted feature like -tmarkdown-indented_code_blocks. That's fine, so I'm hoping a parallel feature to --atx-headers will get more love.

rose00 commented 5 years ago

Workaround: Run pandoc twice as a MD-to-MD filter. The first pass converts fenced blocks to indented blocks. The second pass, with --indented-code-classes=foo, forces conversion of indented blocks back to fenced. Then post-process with sed ' s/``` {\.foo}/```/' to remove the temporary attributes.

rose00 commented 5 years ago

I realize this feature is on a slippery slope, of controlling more and more details of non-semantic variation in the markdown writer.

One of the oddities of markdown is that major semantic entities like headers, code blocks, lists, and tables have several surface notations. They are there because different users like different notations. But the markdown Writer must be opinionated about output notations, since the AST (by design) does not include such non-semantic distinctions. This in turn causes a user like me to reach for a way to tweak the writer to change its opinion about surface syntax. Subtracting standard markdown features like setext headers and indented code blocks seems like a natural way to do this. The slippery slope has additional steps for other non-semantic variations, such as the character and spacing that introduces lists, and table styles.

Here's a suggestion for going down a different slippery slope, that puts the burden on script writers: Have a markdown Reader configuration option which records non-semantic input information in extra attributes. Make the attributes really obvious, and then let scripts filter them out or change them:

some code

would parse to [CodeBlock ("",[],[("markdown-nonsemantic-block-type","backtick")]) "some code"]

This is half-baked, though, since it would immediately call for Writer logic to reverse the process. I guess the --indented-code-classes approach is better, because it gives control to the advanced user.

So maybe the real RFE here is --backtick-code-classes and --fenced-code-classes options, or just --code-block-classes (for all three) to gain a little more control over simple unattributed code blocks of all sorts.

AllenDowney commented 3 years ago

This feature would also help me. For my use case, I need to parse the output from pandoc, and it would be easier to work with backtick code blocks.

jgm commented 3 years ago

@AllenDowney, instead of parsing the output from pandoc, consider using a filter. Then you can interact with data that is already parsed and structured by pandoc.

rose00 commented 3 years ago

Here's an example script.

It is only a partial workaround, because it requires a sed script to post-process the adjusted Pandoc output. Ideally, Pandoc would have an option like --fenced-code-blocks which would request that the parser avoid the hardwired indented code block syntax, just like --atx-headers avoids the hardwired setext headers. The claimed problem with those hardwired output notations is they harder for simple scripts to parse than ## and ``` notations, because they use column counting on multiple lines, and relative indentation, to respectively mark headers and code blocks. It is acknowledged that Markdown is designed more for human eyes than simple parsers, and that trying to adjust Markdown output to be "simple to parse", somehow, is a slippery slope.

pandoc-codeblock.lua.txt

jez commented 1 year ago

I'm currently working around this by invoking pandoc with --indented-code-classes=plain, which pandoc treats as a normal class name, and other markdown processors that use fenced code blocks tend to treat as plain, uncolored text.