Closed matklad closed 1 year ago
Lua regexes (actually "patterns"; they're not really regexes): we rely on them here because they're fast and included. I think that for a C implementation, one approach is to use re2c to generate C functions. That's what I do in cmark. Many of them could just be hand-coded.
Lua is small and easy to embed. So if you just want something in C, it's pretty trivial to use the existing djot code. In fact, if you do make djotbin
it will compile a static executable via C. I don't know what more would be required to get wasm.
How stable? Not very, it's fairly experimental soon.
I think that having a documented AST format is a good idea. In fact, we might want to change the way the AST is represented in the current code. Currently it's in the form of a Lua table that can't be directly translated to JSON, because Lua tables can mix numeric indices and string indices. Actually, there are a bunch of places in the code where I take advantage of this, and all of these make converting to JavaScript or other languages harder, so ideally they could be changed (though there might be some performance implications).
For a JSON AST, I don't know if it's better to do something like
["para", {id = "foo"}, [{"str", "hello"}]]
or rather
[{type = "para", id = "foo", children = [{type = "str", text = "hello"}]}]
For Json AST, I’d say the latter (with the type
key) is significantly better:
It seems that the overriding principle here should be “how convenient is the AST to use directly”, rather than, say, compactness of encoding. And, when consuming from JavaScript, node.type
is nicer than node[0]
.
Additionally, there’s quite a few languages which don’t have ML-style records with anonymous fields (or not consider those idiomatic), and such languages would have to invent names for type
and children
anyway, so we might as well standardize them.
One problem with the latter representation though is that the user could use “type” or “children” as an attribute names, which would be an annoying corner case. We coud nest attrs under an .attr
key, but node.id
does look quite a bit nicer than node.attr.id
.
Yes: I would distinguish between things like level
, which qualifies heading
, and generic attributes, which any node could have and which will all be packed into attr
. So, yes, we'd have attr = { id: "foo" }
.
PS. Another option for the Lua patterns, I suppose, would be to write a C implementation of the pattern matching, or borrow it from Lua's code. But at this point, you start to think, wouldn't I also like some of the other things Lua provides, like tables? Wouldn't that be nicer than implementing linked lists and hash maps in C? And now the idea of just compiling Lua and including the current code as a bytestring looks pretty appealing. (I speak from experience, having written cmark in zero-dependency C.)
I've pushed support for a -j flag that can be combined with -a or -m to produce JSON output.
Looks great! I notice there are _keys
entries. Are they intentional? Looks like perhaps not?
{
"t": "emph",
"c": [
{
"t": "str",
"s": "Hi"
}
],
"attr": {
"key": "val",
"_keys": [ <- this thing looks suspect
"class",
"id",
"key"
],
"class": "foo bar",
"id": "me"
}
},
_keys
is to keep track of the order of attributes, so that it can be rendered in a deterministic way. I could omit it from the AST, but it's needed for deterministic test output.
That’s an interesting semantic question: are attributes ordered? A natural answer is no, attributes are unordered, in which case keys indeed doesn’t belong to ast.
Yeah, I lean towards removing it.
After writing a simple consumer for ast, one thing which struck me a bit odd are somewhat inconsistent abbreviations. s
, t
, c
are single-letter, but level
, lang
and others are words. s
and t
were especially confusing for me, as it’s not immediately obvious which one is “text”.
I am mildly confident that using short words instead of single-letters would be better.
On _keys
: I generate HTML from the AST and test against HTML, so losing deterministic output is a bit of a problem.
On the single letter abbrevi'tions: I started out with children
and type
and text
, but found there was a significant performance degradation from the old AST. Going to the single-letter keys for these extremely common cases sped things up considerably and also removed a lot of the clutter in the AST so the content stood out more.
Hm, it like maybe the problem is that the current impl keeps internal ast and json ast in sync? What about keeping the old, maximally performant ast in, and implementing a more indirect mapping to JSON, so they they can independently optimize for different things: ast for speed, and json for convenience of consumers?
Yes, I could change the names in producing JSON. (And in fact I've just added some code that omits keys beginning with _
, to get rid of _keys
from the JSON AST while still being able to use it in generating HTML.)
I'd have to be persuaded it's a good idea, though. As I mentioned, I found the AST easier to survey with the short letters, because the content:form ratio improves.
I pushed a change that uses longer names. Not sure about type
, though; that might cause problems in some languages.
Yeah, I look at json via | jq | less
, so I don’t perceive verbosity difference. My gut feeling is that more time will be spent looking at the code, manipulating the ast, rather than on JSON itself.
Another argument here would be that single-letter field names go against most style guides, which would force some consumers to rename on parse.
Not sure about type
Uhu… what about tag
? Shorter and in some sense more precise?
tag
is good, I've changed that.
Started rust impl at https://github.com/matklad/djot-rs
This is not serious yet, might run out of steam before it actually works!
Proof of concept of a C library embedding the Lua code is in the clib
directory.
OK, now I even have a demo of the code running in a browser (emscripten).
The playground at https://djot.net/playground/ is now running my wasm-compiled code.
I've used lua metatables to make the JSON output deterministic and allow users to use text
, tag
, and children
transparently when working with the AST.
Note to self: use this approach for attributes to avoid the need for _keys
.
Another nascent Rust impl is here: https://git.sr.ht/~kmaasrud/djr
I'm going to close this issue. Anyone who is writing an alternative implementation should comment over in Discussions!
I've made a typescript implementation: https://github.com/jgm/djot.js
I've spend some time looking at various hypothetical alternative implementations. I didn't do anything practical, but I've learned a bunch, so hopefully this might be useful for someone.
The backstory here is that I'd love to access Djot from Deno, as that seems like the perfect runtime for rendering an extensible lite markup. I've used JS template literals for this in the past, and that's quiet neat, and Deno's security model is also very appropriate for these kinds of converters.
Here's some options:
.html
, its always possible to shell-out). To achieve that, we need to actually define the AST model, so that alternative impls don't just export whatever internal representation they have, but share the general shape of the API. I think a good AST model is already present in the Lua impl, it needs to be documented in an abstract form in the spec (and we can add a canonical JSON encoding for it https://github.com/jgm/djot/issues/58).we could (and, long term, absolutely should) provide a native implementation in something like C, Rust or Zig. Given that djot is a small, nicely organized code-base, this shouldn't be much trouble. I see only two potential snags:
find!("^[*+-] %[[Xx ]%]%s")
into an inline automaton at compile time. Or maybe just bring inregex
for the first version and leave atodo
:)if we had a, say C, impl, compiling that to Wasm and exposing to node&deno would be trivial.
.lua
to.js
might make sense? Perhaps long-term Wasm would be strictly better than.js
, but in today's world.js
can be operationally easier, so why not?lua is implemented in C, so we can compile Lua itself to Wasm, and then interpret djot in Wasm. That seems like the most horrible, but also the most easy way to get going without rewriting everything. And lua to wasm is how playground works.
Sadly, there's a couple of problem on that path. The fundamental thing is that neither browsers nor deno support just importing a WASM module. Instead, you need to do a dance of getting an
Uint8Array
from somewhere and than manually instantiate that. The way this typically works is that wasm bytes arefetched
from some server, but that's very much not a self-contained library then. This fetching is what thewasmoon
, the library used by playground, is doing.An alternative, more friendly for consumers approach is to embed
.wasm
as a base64 string directly into the source code example. This I think is what should be done for this approach, but, as far as I can tell, no-one actually done this for luajit so far? This approach is also somewhat not great, in a sense that the loading would block the JS event loop.So yeah, the next step for this approach would be to re-recreate what wasmoon did with compiling
lua
withemcc
(Emscripten), embed the result (togethre with.lua
files for djot) into a.js
file, and write the required glue code to specialize wasm runtime to lua interpreter and djot parser!