jgm / djot

A light markup language
https://djot.net
MIT License
1.71k stars 43 forks source link

Escapes should either be fully preserved in or fully removed from AST #189

Open matklad opened 1 year ago

matklad commented 1 year ago

The following djot

key \= value

Gives this AST

str text="key "
str text="= value"

I think it should either be

str text="key = value"

or something like

str text="key "
esc text="="
str text=" value"

First option (completely erasing escapes) seems more natural, but I think I have an argument for the second option.

As djot is extensible, certain filters might overlay extra semantics on top of djot syntax. For example, if I really don't want to add an extra line between term and definition in : list, I might write a custom filter which splits the term on =. There would be a corner case -- what if the term itself contains an =? A natural solution would be to escape it:

: E \= mc^2^ = one famous equation

But for that to work, the \ needs to be preserved in the AST.

jgm commented 1 year ago

True, that might be a reason. On the other hand, the notion of an esc is really specific to the source representation, and doesn't make much sense in an AST.

Re your idea for metadata: why not use symbols?

- :author: John M
- :title: The Book

Then you don't have to worry about splitting strings at all.

matklad commented 1 year ago

Yeah, that would work. Though, one problem with symbols is that they restrict name of the keys. For example, if I want to do hierarchical keys, I can do

: author.first_name = Alex

but I can't

- :author.first_name: Alex

Not sure how big of a problem it is. On the one hand, seems mostly like a non-issue to me. On the other hand, formats like TOML usually specifically add fascilities to write keys-which-are-not-valid-idents.

Another case would be :автор:, which isn't a symbol, but maybe wants to be a metadata key?

matklad commented 1 year ago

and doesn't make much sense in an AST.

I am in a state of mind where I can see it both ways. On the one hand, yes, the very purpose of escape is to not be in the AST.

On the other hand, djot is extensible, and we already have some embedding of languages into djot: verbatim blocks embedd programming languages, math embedds latex. With filters and formalized AST, we can actually generalize this idea and have some fragments which are djot, but also have some extra meaning. Eg, a filter that adds a new pair of emphasis characters or something. For this use-cases, preserving escape feels useful.

Although I am not sure we actually should support such non-verbatim emeddings -- the whole idea behind djot is that you don't need to invent custom syntaxes, because spans with attributes should be enough for anything...

jgm commented 1 year ago

I'm not sure how important it is to support this kind of kludge. Still, I'm not sure. I see the value of the esc idea and I'm on the fence.

matklad commented 1 year ago

I think if we add escapes to AST, the principled generalization of that would be to require that AST is lossless (ie, require it to be a concrete syntax tree).

And, if I view it that way, it seems better to keep AST abstract and rely on matches for concrete stuff.

andersk commented 1 year ago

Zulip’s stream/topic link and user/group mention syntaxes are examples of custom markup features that would ideally be implemented as AST postprocessors respecting this str/esc distinction. (We might change the exact syntaxes a bit if we migrate to Djot, but I’m not sure if we’d want to change it drastically enough to comport with the span attribute syntax?)

The “completely erasing escapes” option would complicate what I imagine will be a common use case for the Djot AST. If you’re building a Djot editor with source and preview panes, you want the AST augmented with locations that help you map mouse clicks in the preview pane back to positions in the source pane, and that mapping needs an adjustment for every skipped backslash.

jgm commented 1 year ago

Well, actually, you could probably infer the presence of a backslash from the source locations we already have in the AST!

% djot -t astpretty -p
escaped\"quote
doc
  para (1:1:0-2:0:14)
    str (1:1:0-1:7:6) text="escaped"
    str (1:9:8-1:14:13) text="\"quote"

Note the gap at 1:8, which could only be caused by an escape.

matklad commented 1 year ago

We might change the exact syntaxes a bit if we migrate to Djot

Wait, my favorite chat software is considering to migrate to my favorite light markup language? Lovely! :-)

andersk commented 1 year ago

There are some edge cases where you can’t tell at present.

$ ./djot -t astpretty -p
x{}\@
doc
  para (1:1:0-2:0:5)
    str (1:1:0-1:1:0) text="x"
    str (1:5:4-1:5:4) text="@"
$ ./djot -t astpretty -p
x{ }@
doc
  para (1:1:0-2:0:5)
    str (1:1:0-1:1:0) text="x"
    str (1:5:4-1:5:4) text="@"
vassudanagunta commented 1 year ago

Since source location and escape info are both meta-source info, maybe they belong together, along with any other such info that could be added down the road. For example, the * below explicitly indicates the presence of an escape char just prior to that range:

doc
  para (1:1:0-2:0:14)
    str (1:1:0-1:7:6) text="escaped"
    str (*1:9:8-1:14:13) text="\"quote"

The downside may also be upside: disabling the emission of source location would also disable emission of escape info. The upside argument would be that a client either is source dependent or it is not.

And variation of @matklad's idea which keeps esc info separate:

str (source loca) text="key "
esc (source loca) 
str (source loca) text="= value"
andersk commented 1 year ago

Wait, my favorite chat software is considering to migrate to my favorite light markup language? Lovely! :-)

Yeah, we’ve been seriously looking at it. The main issues that came up in our evaluation are:

jgm commented 1 year ago

I'm definitely tempted to include an esc element in the AST, even though it's somewhat against the spirit of an AST. Arguably, though, so is distinguishing between spaces and softbreak.