jgm / djot

A light markup language
https://djot.net
MIT License
1.67k stars 43 forks source link

Alternative implementations #60

Closed matklad closed 1 year ago

matklad commented 1 year ago

I've spend some time looking at various hypothetical alternative implementations. I didn't do anything practical, but I've learned a bunch, so hopefully this might be useful for someone.

The backstory here is that I'd love to access Djot from Deno, as that seems like the perfect runtime for rendering an extensible lite markup. I've used JS template literals for this in the past, and that's quiet neat, and Deno's security model is also very appropriate for these kinds of converters.

Here's some options:

jgm commented 1 year ago

Lua regexes (actually "patterns"; they're not really regexes): we rely on them here because they're fast and included. I think that for a C implementation, one approach is to use re2c to generate C functions. That's what I do in cmark. Many of them could just be hand-coded.

Lua is small and easy to embed. So if you just want something in C, it's pretty trivial to use the existing djot code. In fact, if you do make djotbin it will compile a static executable via C. I don't know what more would be required to get wasm.

How stable? Not very, it's fairly experimental soon.

I think that having a documented AST format is a good idea. In fact, we might want to change the way the AST is represented in the current code. Currently it's in the form of a Lua table that can't be directly translated to JSON, because Lua tables can mix numeric indices and string indices. Actually, there are a bunch of places in the code where I take advantage of this, and all of these make converting to JavaScript or other languages harder, so ideally they could be changed (though there might be some performance implications).

For a JSON AST, I don't know if it's better to do something like

["para", {id = "foo"}, [{"str", "hello"}]]

or rather

[{type = "para", id = "foo", children = [{type = "str", text = "hello"}]}]
matklad commented 1 year ago

For Json AST, I’d say the latter (with the type key) is significantly better:

It seems that the overriding principle here should be “how convenient is the AST to use directly”, rather than, say, compactness of encoding. And, when consuming from JavaScript, node.type is nicer than node[0].

Additionally, there’s quite a few languages which don’t have ML-style records with anonymous fields (or not consider those idiomatic), and such languages would have to invent names for type and children anyway, so we might as well standardize them.

One problem with the latter representation though is that the user could use “type” or “children” as an attribute names, which would be an annoying corner case. We coud nest attrs under an .attr key, but node.id does look quite a bit nicer than node.attr.id.

jgm commented 1 year ago

Yes: I would distinguish between things like level, which qualifies heading, and generic attributes, which any node could have and which will all be packed into attr. So, yes, we'd have attr = { id: "foo" }.

jgm commented 1 year ago

PS. Another option for the Lua patterns, I suppose, would be to write a C implementation of the pattern matching, or borrow it from Lua's code. But at this point, you start to think, wouldn't I also like some of the other things Lua provides, like tables? Wouldn't that be nicer than implementing linked lists and hash maps in C? And now the idea of just compiling Lua and including the current code as a bytestring looks pretty appealing. (I speak from experience, having written cmark in zero-dependency C.)

jgm commented 1 year ago

I've pushed support for a -j flag that can be combined with -a or -m to produce JSON output.

matklad commented 1 year ago

Looks great! I notice there are _keys entries. Are they intentional? Looks like perhaps not?

        {
          "t": "emph",
          "c": [
            {
              "t": "str",
              "s": "Hi"
            }
          ],
          "attr": {
            "key": "val",
            "_keys": [     <- this thing looks suspect
              "class",
              "id",
              "key"
            ],
            "class": "foo bar",
            "id": "me"
          }
        },
jgm commented 1 year ago

_keys is to keep track of the order of attributes, so that it can be rendered in a deterministic way. I could omit it from the AST, but it's needed for deterministic test output.

matklad commented 1 year ago

That’s an interesting semantic question: are attributes ordered? A natural answer is no, attributes are unordered, in which case keys indeed doesn’t belong to ast.

jgm commented 1 year ago

Yeah, I lean towards removing it.

matklad commented 1 year ago

After writing a simple consumer for ast, one thing which struck me a bit odd are somewhat inconsistent abbreviations. s, t, c are single-letter, but level, lang and others are words. s and t were especially confusing for me, as it’s not immediately obvious which one is “text”.

I am mildly confident that using short words instead of single-letters would be better.

jgm commented 1 year ago

On _keys: I generate HTML from the AST and test against HTML, so losing deterministic output is a bit of a problem.

On the single letter abbrevi'tions: I started out with children and type and text, but found there was a significant performance degradation from the old AST. Going to the single-letter keys for these extremely common cases sped things up considerably and also removed a lot of the clutter in the AST so the content stood out more.

matklad commented 1 year ago

Hm, it like maybe the problem is that the current impl keeps internal ast and json ast in sync? What about keeping the old, maximally performant ast in, and implementing a more indirect mapping to JSON, so they they can independently optimize for different things: ast for speed, and json for convenience of consumers?

jgm commented 1 year ago

Yes, I could change the names in producing JSON. (And in fact I've just added some code that omits keys beginning with _, to get rid of _keys from the JSON AST while still being able to use it in generating HTML.)

I'd have to be persuaded it's a good idea, though. As I mentioned, I found the AST easier to survey with the short letters, because the content:form ratio improves.

jgm commented 1 year ago

I pushed a change that uses longer names. Not sure about type, though; that might cause problems in some languages.

matklad commented 1 year ago

Yeah, I look at json via | jq | less, so I don’t perceive verbosity difference. My gut feeling is that more time will be spent looking at the code, manipulating the ast, rather than on JSON itself.

Another argument here would be that single-letter field names go against most style guides, which would force some consumers to rename on parse.

Not sure about type

Uhu… what about tag? Shorter and in some sense more precise?

jgm commented 1 year ago

tag is good, I've changed that.

matklad commented 1 year ago

Started rust impl at https://github.com/matklad/djot-rs

This is not serious yet, might run out of steam before it actually works!

jgm commented 1 year ago

Proof of concept of a C library embedding the Lua code is in the clib directory.

jgm commented 1 year ago

OK, now I even have a demo of the code running in a browser (emscripten).

jgm commented 1 year ago

The playground at https://djot.net/playground/ is now running my wasm-compiled code.

jgm commented 1 year ago

I've used lua metatables to make the JSON output deterministic and allow users to use text, tag, and children transparently when working with the AST.

Note to self: use this approach for attributes to avoid the need for _keys.

matklad commented 1 year ago

Another nascent Rust impl is here: https://git.sr.ht/~kmaasrud/djr

jgm commented 1 year ago

I'm going to close this issue. Anyone who is writing an alternative implementation should comment over in Discussions!

jgm commented 1 year ago

I've made a typescript implementation: https://github.com/jgm/djot.js