Alternative implementations

matklad commented 1 year ago

I've spend some time looking at various hypothetical alternative implementations. I didn't do anything practical, but I've learned a bunch, so hopefully this might be useful for someone.

The backstory here is that I'd love to access Djot from Deno, as that seems like the perfect runtime for rendering an extensible lite markup. I've used JS template literals for this in the past, and that's quiet neat, and Deno's security model is also very appropriate for these kinds of converters.

Here's some options:

the main benefit of an alternative impl would be that you'd be able to massage the output programmatically in the language of your choice (if you need just .html, its always possible to shell-out). To achieve that, we need to actually define the AST model, so that alternative impls don't just export whatever internal representation they have, but share the general shape of the API. I think a good AST model is already present in the Lua impl, it needs to be documented in an abstract form in the spec (and we can add a canonical JSON encoding for it https://github.com/jgm/djot/issues/58).
we could (and, long term, absolutely should) provide a native implementation in something like C, Rust or Zig. Given that djot is a small, nicely organized code-base, this shouldn't be much trouble. I see only two potential snags:
- how stable is djot? It would be no fun to chase upstream from Rust. It seems like it should be pretty stable
- Lua implementation relies on Lua's regexes, and bringing in full regex engine for a native impl seems like an overkill. So, either some amount of manual code-uglification is required, or some compilering to implement a proc-macro or some such to transform find!("^[*+-] %[[Xx ]%]%s") into an inline automaton at compile time. Or maybe just bring in regex for the first version and leave a todo :)
if we had a, say C, impl, compiling that to Wasm and exposing to node&deno would be trivial.
Can be just derive a bunch of implementations from a unified grammar? From what I understand how those things actually work in practice, no, not really.
lua and JavaScript seem sufficiently close (eg, both have regexes built-in), so that manual "transpiling" of .lua to .js might make sense? Perhaps long-term Wasm would be strictly better than .js, but in today's world .js can be operationally easier, so why not?
lua is implemented in C, so we can compile Lua itself to Wasm, and then interpret djot in Wasm. That seems like the most horrible, but also the most easy way to get going without rewriting everything. And lua to wasm is how playground works.

Sadly, there's a couple of problem on that path. The fundamental thing is that neither browsers nor deno support just importing a WASM module. Instead, you need to do a dance of getting an Uint8Array from somewhere and than manually instantiate that. The way this typically works is that wasm bytes are fetched from some server, but that's very much not a self-contained library then. This fetching is what the wasmoon, the library used by playground, is doing.

An alternative, more friendly for consumers approach is to embed .wasm as a base64 string directly into the source code example. This I think is what should be done for this approach, but, as far as I can tell, no-one actually done this for luajit so far? This approach is also somewhat not great, in a sense that the loading would block the JS event loop.

So yeah, the next step for this approach would be to re-recreate what wasmoon did with compiling lua with emcc (Emscripten), embed the result (togethre with .lua files for djot) into a .js file, and write the required glue code to specialize wasm runtime to lua interpreter and djot parser!

jgm commented 1 year ago

Lua regexes (actually "patterns"; they're not really regexes): we rely on them here because they're fast and included. I think that for a C implementation, one approach is to use re2c to generate C functions. That's what I do in cmark. Many of them could just be hand-coded.

Lua is small and easy to embed. So if you just want something in C, it's pretty trivial to use the existing djot code. In fact, if you do make djotbin it will compile a static executable via C. I don't know what more would be required to get wasm.

How stable? Not very, it's fairly experimental soon.

I think that having a documented AST format is a good idea. In fact, we might want to change the way the AST is represented in the current code. Currently it's in the form of a Lua table that can't be directly translated to JSON, because Lua tables can mix numeric indices and string indices. Actually, there are a bunch of places in the code where I take advantage of this, and all of these make converting to JavaScript or other languages harder, so ideally they could be changed (though there might be some performance implications).

For a JSON AST, I don't know if it's better to do something like

["para", {id = "foo"}, [{"str", "hello"}]]

or rather

[{type = "para", id = "foo", children = [{type = "str", text = "hello"}]}]

matklad commented 1 year ago

For Json AST, I’d say the latter (with the type key) is significantly better:

It seems that the overriding principle here should be “how convenient is the AST to use directly”, rather than, say, compactness of encoding. And, when consuming from JavaScript, node.type is nicer than node[0].

Additionally, there’s quite a few languages which don’t have ML-style records with anonymous fields (or not consider those idiomatic), and such languages would have to invent names for type and children anyway, so we might as well standardize them.

One problem with the latter representation though is that the user could use “type” or “children” as an attribute names, which would be an annoying corner case. We coud nest attrs under an .attr key, but node.id does look quite a bit nicer than node.attr.id.

jgm commented 1 year ago

Yes: I would distinguish between things like level, which qualifies heading, and generic attributes, which any node could have and which will all be packed into attr. So, yes, we'd have attr = { id: "foo" }.

jgm commented 1 year ago

PS. Another option for the Lua patterns, I suppose, would be to write a C implementation of the pattern matching, or borrow it from Lua's code. But at this point, you start to think, wouldn't I also like some of the other things Lua provides, like tables? Wouldn't that be nicer than implementing linked lists and hash maps in C? And now the idea of just compiling Lua and including the current code as a bytestring looks pretty appealing. (I speak from experience, having written cmark in zero-dependency C.)

jgm commented 1 year ago

I've pushed support for a -j flag that can be combined with -a or -m to produce JSON output.

matklad commented 1 year ago

Looks great! I notice there are _keys entries. Are they intentional? Looks like perhaps not?

        {
          "t": "emph",
          "c": [
            {
              "t": "str",
              "s": "Hi"
            }
          ],
          "attr": {
            "key": "val",
            "_keys": [     <- this thing looks suspect
              "class",
              "id",
              "key"
            ],
            "class": "foo bar",
            "id": "me"
          }
        },

jgm commented 1 year ago

_keys is to keep track of the order of attributes, so that it can be rendered in a deterministic way. I could omit it from the AST, but it's needed for deterministic test output.

matklad commented 1 year ago

That’s an interesting semantic question: are attributes ordered? A natural answer is no, attributes are unordered, in which case keys indeed doesn’t belong to ast.

jgm commented 1 year ago

Yeah, I lean towards removing it.

matklad commented 1 year ago

After writing a simple consumer for ast, one thing which struck me a bit odd are somewhat inconsistent abbreviations. s, t, c are single-letter, but level, lang and others are words. s and t were especially confusing for me, as it’s not immediately obvious which one is “text”.

I am mildly confident that using short words instead of single-letters would be better.

jgm commented 1 year ago

On _keys: I generate HTML from the AST and test against HTML, so losing deterministic output is a bit of a problem.

On the single letter abbrevi'tions: I started out with children and type and text, but found there was a significant performance degradation from the old AST. Going to the single-letter keys for these extremely common cases sped things up considerably and also removed a lot of the clutter in the AST so the content stood out more.

matklad commented 1 year ago

Hm, it like maybe the problem is that the current impl keeps internal ast and json ast in sync? What about keeping the old, maximally performant ast in, and implementing a more indirect mapping to JSON, so they they can independently optimize for different things: ast for speed, and json for convenience of consumers?

jgm commented 1 year ago

Yes, I could change the names in producing JSON. (And in fact I've just added some code that omits keys beginning with _, to get rid of _keys from the JSON AST while still being able to use it in generating HTML.)

I'd have to be persuaded it's a good idea, though. As I mentioned, I found the AST easier to survey with the short letters, because the content:form ratio improves.

jgm commented 1 year ago

I pushed a change that uses longer names. Not sure about type, though; that might cause problems in some languages.

matklad commented 1 year ago

Yeah, I look at json via | jq | less, so I don’t perceive verbosity difference. My gut feeling is that more time will be spent looking at the code, manipulating the ast, rather than on JSON itself.

Another argument here would be that single-letter field names go against most style guides, which would force some consumers to rename on parse.

Not sure about type

Uhu… what about tag? Shorter and in some sense more precise?

jgm commented 1 year ago

tag is good, I've changed that.

matklad commented 1 year ago

Started rust impl at https://github.com/matklad/djot-rs

This is not serious yet, might run out of steam before it actually works!

jgm commented 1 year ago

Proof of concept of a C library embedding the Lua code is in the clib directory.

jgm commented 1 year ago

OK, now I even have a demo of the code running in a browser (emscripten).

jgm commented 1 year ago

The playground at https://djot.net/playground/ is now running my wasm-compiled code.

jgm commented 1 year ago

I've used lua metatables to make the JSON output deterministic and allow users to use text, tag, and children transparently when working with the AST.

Note to self: use this approach for attributes to avoid the need for _keys.

matklad commented 1 year ago

Another nascent Rust impl is here: https://git.sr.ht/~kmaasrud/djr

jgm commented 1 year ago

I'm going to close this issue. Anyone who is writing an alternative implementation should comment over in Discussions!

jgm commented 1 year ago

I've made a typescript implementation: https://github.com/jgm/djot.js

jgm / djot

Alternative implementations #60