Allow empty/ default attributes in markdown codeblocks

dmadisetti commented 3 months ago

Describe your proposed improvement and the problem it solves.

Currently ```{attribute=} is fine, so ```{attribute} should be fine. I think this notation is a tiny bit cleaner; and is consistent with HTML. This is currently breaking behavior that causes the code block to not be recognized at all, but this is also fine on looser matches like github:

```md {example}
# check raw


This also solves https://github.com/jgm/pandoc/issues/8645 since quarto can just use the empty attribute (they currently have an additional filter to get around this behavior).

**Implementation**
This might be a bit far-reaching (since `attribute` is used in multiple places), but here's a simple fix:

```haskell {emptyAttr}
attribute = identifierAttr <|> classAttr <|> keyValAttr <|> specialAttr <|> emptyAttr
-- ...
emptyAttr :: PandocMonad m => MarkdownParser m (Attr -> Attr)
emptyAttr = try $ do
  key <- identifier
  return $ \(id',cs,kvs) ->(id',cs,kvs ++ [(key,"")])

But could also scope this to just codeblockFenced

Describe alternatives you've considered.

Just be more verbose and use the {attribute=} notation, but this "feature" borders on bug for me.

Let me know if this is a won't fix- otherwise I'll submit a PR

dmadisetti commented 2 months ago

@jgm, I'm about to start working on this unless you disagree (it's a slow Friday night)

jgm commented 2 months ago

I just haven't decided on this issue yet.

The suggestion has come up before -- perhaps in these issues, perhaps on pandoc-discuss, perhaps in commonmark issue tracker or forum, I can't recall. Some have advocated reserving the bare names for another purpose. It would be good to locate the prior discussion before proceeding further.

dmadisetti commented 2 months ago

I didn't see anything in discussions, but here's a quick scan of 100+ open issues with the very loose search "markdown attribute" (search quey: https://github.com/search?q=repo%3Ajgm%2Fpandoc+markdown+attribute++&type=issues&state=open&p=12):

quarto complaining that 2 -> 3 broken directives (already mentioned in issue #8645)
5992 for general expansion of cell block attributes
7622 and other MYST issues with their weird syntax. I don't think think that this will help them, as there are more weird things in their syntax.

I don't see anyone claiming this exact solution, you hint that to solve the issue quarto is facing, attribute behavior would have to be addressed in https://github.com/jgm/pandoc/issues/8645#issuecomment-1440505856

The only case I can really think that would be affected is the quarto implementation. I had a quick look, and their work around is in this filter: https://github.com/quarto-dev/quarto-cli/blob/e156acb080bdf3a51ed51a179129fcab06122c4a/src/resources/filters/quarto-pre/engine-escape.lua

regex here:

https://github.com/quarto-dev/quarto-cli/blob/e156acb080bdf3a51ed51a179129fcab06122c4a/src/resources/filters/modules/patterns.lua#L29

The proposed change would just add a superfluous attribute to quarto documents since they check directly against the text, but it wouldn't break their pipeline. I did a quick test:

Here's test.md

```{python}
import numpy as np

import numpy as np


Here's current behavior for `pandoc test.md -o test.html`

```html
<p><code>{python} import numpy as np</code></p>
<p><code>python {foo} import numpy as np</code></p>

Here's the result of the proposed behavior:

<pre data-python=""><code>import numpy as np</code></pre>
<div class="sourceCode" id="cb2" data-foo=""><pre
class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span></code></pre></div>

jgm commented 2 months ago

Here is one of the issues I may have been thinking of (in djot, but djot uses basically the same attribute syntax as pandoc's markdown): https://github.com/jgm/djot/issues/257

The worry expressed there is that if bare words are allowed, there's more chance of "false positives" for attributes: {This is a sentence in braces} would be attributes, for example.

With current attribute syntax this kind of "false positive" would be very unlikely.

You might think that this isn't a big problem in pandoc, since attributes are only allowed in certain special contexts. But we might want to expand this to allow attributes on every element.

I'm not sure how much of a concern this is, but it's the sort of issue that needs some thought before there is any expansion of current attribute syntax.

cscheid commented 2 months ago

This also solves https://github.com/jgm/pandoc/issues/8645 since quarto can just use the empty attribute (they currently have an additional filter to get around this behavior).

Thanks for connecting the two issues, but (as a Quarto dev) let me just say that Quarto already has enough going on and workarounds for the current behavior. I don't think it's necessary to try and address #8645 together with this.

dmadisetti commented 2 months ago

Hmm, the conversation in djot does add some more context. Quick search for /\{(\w+\s+))+\w\}/ on github https://github.com/search?q=%2F%5C%7B%28%5Cw%2B%5Cs%2B%29%2B%5Cw%5C%7D%2F+language%3AMarkdown&type=code&p=2 reveals matches mainly used in tex settings or within code blocks for typescript. I found one instance of notes without tex on set theory, but even that had commas in the braces that would mark it as invalid.

Single token {words} turns up a few more cases. People are already doing this for code blocks, and I found one person doing it to denote return type (e.g. {integer}), which is an interesting stylistic choice.

Obviously, GitHub does not cover all markdown text, but I don't think this is an observed pattern.

jgm / pandoc

Allow empty/ default attributes in markdown codeblocks #9791

Let me know if this is a won't fix- otherwise I'll submit a PR

5992 for general expansion of cell block attributes

7622 and other MYST issues with their weird syntax. I don't think think that this will help them, as there are more weird things in their syntax.