jgm / djot

A light markup language
https://djot.net
MIT License
1.67k stars 43 forks source link

Consider revealing `:smile:` -> šŸ˜€ in the AST #62

Closed matklad closed 1 year ago

matklad commented 1 year ago

The following djot document:

Hello -- :smile:

produces the following ast:

{
  "footnotes": [],
  "references": [],
  "type": "doc",
  "children": [
    {
      "type": "para",
      "children": [
        {
          "type": "str",
          "text": "Hello "
        },
        {
          "type": "en_dash",
          "text": "--"
        },
        {
          "type": "str",
          "text": " "
        },
        {
          "type": "emoji",
          "text": ":smile:"
        }
      ]
    }
  ]
}

The problem here is that smile="šŸ˜„", part is implicit -- consumer of such ast would have to replicate djot's emoji table. It would help to add "rendered" emojis to the output, even if that info is in some sense redundant.

Thinking more about this, maybe we don't even need dedicated AST nodes like emoji or en_dash? We can say that they are in fact str nodes, just with a raw attribute:

{
  "footnotes": [],
  "references": [],
  "type": "doc",
  "children": [
    {
      "type": "para",
      "children": [
        {
          "type": "str",
          "text": "Hello "
        },
        {
          "type": "str",
          "text": "ā€“",
          "raw": "--"
        },
        {
          "type": "str",
          "text": " "
        },
        {
          "type": "str",
          "text": "šŸ˜„",
          "raw": ":smile:"
        }
      ]
    }
  ]
}

There might be some terminological mishappening here. In the literal syntax tree, we certainly have the type: "emoji" syntax node. But what we want from -a -j is probably not as much an AST, as an abstract document model. So, syntactically :smile: is emoji, but semantically it wants to be very close šŸ˜„ (eg, substituting :emoji: syntax with their unicode equivaents shouldn't chage the meaning of a djot document).

jgm commented 1 year ago

The reason I did it this way is that renderers that consume the AST might make different decisions about how to render the emoji. For example, a djot renderer might want :smile:, while in HTML you'd probably want :smile:

Another reason isl to avoid requiring the parser to have the big emoji table. So, for example, :oeu: gets parsed as an emoji with alias oeu; the renderer will look this up, not find it, and have to figure out what to do.

jgm commented 1 year ago

I can see why it would be convenient to have the entity resolved to unicode in the AST. (That would free the consumer of the need to do the substitution, and why not? since we have the entity table in djot.)

matklad commented 1 year ago

might make different decisions about how to render the emoji.

This I think is addressed by having both text and raw.

Another reason isl to avoid requiring the parser to have the big emoji table

This sounds very convincing to me. And, if we don't substitute emojis, substituting em,en dashes, quotes and ellipsis doesn't make sense (they are a drop in the bucket).

I guess what would help here is to have a canonical table with substitution as a part of the spec, so that:

and why not? since we have the entity table in djot.)

I think a good argument to not do this is to make sure that consumers actually can work with implementations which doesn't bundle an emoji table. That is, if the ref impl does this because "why not?", then every other impl would be pushed to do this anyway.

jgm commented 1 year ago

I just pushed a change that does the resolution in the parser. But this can always be reverted. I see both sides of it. Note that we could still say that parser implementations aren't required to implement emoji lookup -- the renderer could still do its own lookup on the alias provided.