jgm / djot

A light markup language
https://djot.net
MIT License
1.62k stars 43 forks source link

Possible parsing ambiguity: attribute key starting with punctuation #242

Open faelys opened 10 months ago

faelys commented 10 months ago

Hello,

I'm still discovering the syntax and trying to understand the existing parser, please let me know if I missed something.

As far as I understand, attribute keys and bare value allow _, :, and - anywhere, including in the first character of a key and the last character of the value.

Therefore:

I find it not very satisfying, that = having a potentially very long range, and the overall (admittedly contrieved) construct being hard to disambiguate with the brain.

The samething happens with - instead of _, replacing em with del.

Wouldn't it be simpler for both humans and parsers to forbid punctuation at the beginning of an attribute key? (Or would it break too much existing text?)

jgm commented 10 months ago

This is the grammar in the comments in attributes.ts:

 * syntax:
 *
 * attributes <- '{' whitespace* attribute (whitespace attribute)* whitespace* '}'
 * attribute <- identifier | class | keyval
 * identifier <- '#' name
 * class <- '.' name
 * name <- (nonspace, nonpunctuation other than ':', '_', '-')+
 * keyval <- key '=' val
 * key <- (ASCII_ALPHANUM | ':' | '_' | '-')+
 * val <- bareval | quotedval
 * bareval <- (ASCII_ALPHANUM | ':' | '_' | '-')+
 * quotedval <- '"' ([^"] | '\"') '"'
jgm commented 10 months ago

I see the issue with allowing _ and - at the beginning of a key name, given the syntactic roles of {_ and {-. I'm open to tightening up the syntax here. @matklad any thoughts?

matklad commented 9 months ago

Thoughts:

I am torn about what's the best solution here. Given that we already assign special meaning to .ident and #ident, it seems safest to require keys and values to start with ASCII_ALPHANUM (and then also allow . in the middle)

matklad commented 9 months ago

Actually, did something change? I can no longer reproduce the original example on the playground.

Here's what I get

word{_key=value_}
<p>word<em>key=value</em></p>

word{_key=value}
<p>word{_key=value}</p>

So that it seems that we just never parse {_ as attribute, and { _ is required for disambiguating.

jgm commented 9 months ago

I never did check the actual behavior. Nonetheless, this is a parsing ambiguity. We should at the very least document that the emphasis interpretation takes precedence, and maybe go further and disallow _ at the beginning of keys.

As for . inside keys, I'm open to that.

bpj commented 9 months ago

I think it's safer to disallow underscores at the start of keys, as it removes any ambiguity. I have a feeling that this will be a recurrent issue otherwise.

Does HTML allow underscores at the start of attribute names? Not that djot should be as bound to HTML as Pandoc's element types still largely are.

andersk commented 6 months ago

It does; HTML allows attributes to consist of all characters other than controls, `,",',>,/,=`, and noncharacters.