jgm / djot

A light markup language
https://djot.net
MIT License
1.66k stars 43 forks source link

Simpler inline container parsing precedence #124

Open hellux opened 1 year ago

hellux commented 1 year ago

The current syntax description states

The basic principle governing “precedence” for inline containers is that the first opener that gets closed takes precedence. Containers can’t overlap, so once an opener gets closed, any potential openers between the opener and the closer get marked as regular text and can no longer open inline syntax.

So if we encounter e.g.

*_x

we cannot immediately know what to emit, we may need to continue until the end of the current block to know if we should emit *_x, <strong>_x or <strong><em>x.

An alternative would be to use the same approach as we do for block containers: always open on openers and close all inner containers whenever a parent container is closed. One would also close all inline containers when the block is exited, similar to how inline code spans are now implicitly closed in order to avoid backtracking.

This would allow us to immediately emit <strong><em>x when the above is encountered, without having to look further ahead. If then a * is encountered before _ we close both _ and *.

Some examples:

I think this would simplify the parsing and support goal 1: no backtracking. The precedence primarily affects the output of what I would consider "erroneous" syntax. To me, in terms of output, it does not seem very important which precedence is used as it is clearly an error anyway.

However, one case where it could be considered not an error is if an opener is unmatched, for example a single _. Currently it becomes simply regular text, but this change would make it an opener. It might be useful to treat it as regular text, but on the other hand it is also confusing if emphasis is suddenly added when another single _ happens to appear somewhere else in the block.

One thing that makes this proposed change less effective at simplifying the parsing is links and spans, they both have a container that starts with [ and can contain arbitrary inline content. It is not until after the closing ] one knows whether to emit an <a> or a <span>. One also needs to read the following [], () or {} container to know what attributes the opener should have. Avoiding this would require major changes to the syntax that may affect readability. Attributes could be placed before words/spans just like for blocks, e.g. {lang=en}word. For links, the URL/label could also be placed before the text, e.g. <url>[text], <~label>[text]. Not sure about the second one, but first one seems quite intuitive with the auto link syntax.

The proposed change would help to avoid things like #125 where in {--- the {- gets grouped even when unmatched and the - cannot be used with the following hyphens to form an em-dash. The proposed change would simply make it an opener, unless escaped. If the { is escaped one can simply treat the { and - as separate characters and continue parsing to form the em-dash.

Containers that cannot contain inline content can be parsed as they are today, e.g. verbatim, emoji, footnote reference and attribute containers.

jgm commented 1 year ago

I think the unexpected effects of this would be pretty bad: unpaired * and _ characters aren't that uncommon. Certainly not worth it just to avoid #125, and we already have a non-backtracking algorithm for parsing emphasis.