jgm / djot

A light markup language
https://djot.net
MIT License
1.66k stars 43 forks source link

Treat autolinks as leaves in the AST #105

Closed matklad closed 1 year ago

matklad commented 1 year ago

for <http://example.com> we produce AST like this:

        {
          "tag": "url",
          "destination": "http://example.com/a_b",
          "children": [
            {
              "tag": "str",
              "text": "http://example.com/a_b"
            }
          ]
        }

and we emit original matches as:

https://github.com/jgm/djot/blob/2c0646f42e47c43c4ddaa28b0ad63a9d7da51107/djot/inline.lua#L232-L234

url>str nesting seem superflows, just a flat

self:add_match(starturl, endurl, "url") 

seems like it should be sufficient?

jgm commented 1 year ago

I don't remember whether I had a good reason for doing it this way. Perhaps I thought we could allow these to split over multiple lines, as we currently allow for URLs in regular links:

[My link text](http://example.com?product_number=234234234234
234234234234)

This would call for start and end, plus str and softbreak inside. Any thoughts on that?

Anyway, we could still make url a leaf in the AST, idependent of the structure of the matches, and I think that's a good idea in either case.

jgm commented 1 year ago

On second thought, there is a drawback I hadn't considered to collapsing things in the AST using get_string_content (which we use, e.g., to extract a string destination for a link from what may be several str and softbreak elements).

The drawback is that we use fine-grained source position information. The same is true of collapsing the contents of code blocks into a single string (which we also do now). Consider

> ```
> my code
>  is here
> ```

If we just say that the code block starts at l.1c.3 and continues through l4.c5, we aren't recording the fact that not all of the characters between those two positions are part of the code block.

On balance it's probably better for the AST not to worry about this; one could extract it from the match objects if one wanted this kind of fine grained positional information.

jgm commented 1 year ago

One advantage of the current setup is that in the renderer we can just do

Renderer.url = Renderer.link

Renderer.email = Renderer.link

because the structures are identical.

matklad commented 1 year ago

Hm, but the structures would be identical either way? They'll both be nested or flat.

Thinking more about this, do we actually need to distinguish between url & email in the AST?

<http://example.com>
<aleksey.kladov@example.com>

I think can be represented as

        {
          "tag": "url",
          "destination": "http://example.com",
        },
        {
          "tag": "softbreak"
        },
        {
          "tag": "url",
          "destination": "mailto:aleksey.kladov@example.com",
        }

mailto: prefix in the destination seem sufficient to distinguish the two cases?

jgm commented 1 year ago

Regular link nodes need a nested structure, because the link descriptions can contain formatting.

jgm commented 1 year ago

We could indeed just use link for all three cases, but some people have asked to retain the distinction between e.g. <me@example.com> and [me@example.com](mailto:me@example.com). I'm not sure.

matklad commented 1 year ago

Regular link nodes need a nested structure, because the link descriptions can contain formatting.

Ah, sorry, I misunderstood you. Yeah, link I think should be different from autolink, but for autolinks, distinguishing between email and http url doesn't seem that useful.

jgm commented 1 year ago

The current AST has email and url as leaf nodes, so closing.

{
  "tag": "doc",
  "references": {},
  "footnotes": {},
  "children": [
    {
      "tag": "para",
      "children": [
        {
          "tag": "email",
          "text": "me@example.com"
        }
      ]
    },
    {
      "tag": "para",
      "children": [
        {
          "tag": "url",
          "text": "http://example.com"
        }
      ]
    }
  ]
}