Specification for AST - Githubissues

matklad commented 1 year ago

TL;DR: AST should be specified in the reference. I think the best way to do that is via TypeScript type notation.

I've noticed that markup languages can fail extensibility in two ways:

by not having generic syntax which can be used for semantic extension points (markdown)
by marrying extensibility API to a particular implementation.

The example of the latter is AsciiDoctor. Although, like djot, it has a generic block structure on the syntax level, the way to extend AsciiDoctor is by writing plugins against specific asciidoctor implementation. Thus, you get extensions of a particular toolchain, not extensions of particular syntax.

I think the way to combat that is to specify AST structure which must be common across all implementations. That way, if extensibility is expressed as AST -> AST transform, you can mix and match readers, filters, and writers (provided that AST can be serialized as data).

This I think is a somewhat underappreciated idea, so my primary goal here is, by having an "here's the AST" section in the reference, to encourage people to implement djot tools in terms of AST, so that things like djot_parser_in_rust paper.djot | djot2pdf_in_haskell just work. The secondary goal is of course to make sure that separate implementations agree not only on the HTML, but on the AST as well.

How do we define AST? I think "ast is JSON" is a good start. JSON is ubiquitous, and is a good match for "scripting" languages, which I think are most natural for doing filters and writers. The problem with JSON is that, as far as I know, there's no uncontroversial way to specify or "type" JSON.

The official answer is JSON Schema, but it's objectively unfit for human consumption. What I've found to work much better in practice are just TypeScript definitions (this comes from my experience with LSP). So, practically, I would consider adding djot_ast.d.ts file with a reasonable subset of TypeScript as a part of the spec, along these lines:

type Node = Doc | Para | Str;

interface Doc {
  tag: "doc";
  children: Node[];
  references: Record<string, Refrence>;
  footnotes: Record<string, Footnote>;
}

interface Para {
  tag: "para";
  children: Node[];
}

interface Str {
  tag: "str";
  text: string;
}

eproxus commented 1 year ago

Another alternative way to define a schema that would become JSON is to use Cue: https://cuelang.org/docs/usecases/datadef/

jgm commented 1 year ago

Here's a definition of the AST using TypeScript notation:

https://github.com/jgm/djot.js/blob/main/src/ast.ts#L4

I don't know if that's the sort of thing you had in mind, @matklad

matklad commented 1 year ago

Yup, that looks lovely!

I would suggest adding some form of that to https://djot.net, to:

actually specify the AST, and not only HTML output
make it more obvious to the readers that djot goes beyond HTML, and that you can do whatever with it

In terms of specific things:

Because TypesScript is structural, I think the following two are equivalent, and the second one looks somewhat more readable to me.

interface Section extends HasAttributes, HasChildren<Block> {
  tag: "section";
}

interface Section extends HasAttributes {
  tag: "section";
  children: Block[];
}

(and we do that for List anyway). There's also the angle that, if we treat that as specifciation of AST, then it's benefitial to keep to "dumb" TypeScript, and inhereting from a generic interface is a bit indirect.

List.start is nullable, but it seems we can fill it during parsing with default value? That is, I think in the current impl start: 1 and start: undefined two ways to say the same thing, and we can avoid that?
{Right,Left}{Single,Double}Quote, {Em,En}Dash feel like they maybe don't pull their weight as separate types. They all have the same shape: substituting one string for another.
interface Symb extends HasAttributes { tag: "symbol"; } -- name of the type and the tag name is inconsistent. Wants to be tag: "symb" perhaps?
children: (Term | Definition)[]; I think that's the right way to reperesent this in the AST, but it maybe makes sense to add a comment that there's at most one term/definition, and that the term is first.
type AstNode = ... -- not sure that Footnote and Reference belong there, as they are not children.

bpj commented 1 year ago

there's at most one term/definition

Is there now? So this works like the LaTeX itemize environment rather than like HTML definition lists or Pandoc/Markdown definition lists? Then maybe it should have another name ("itemiz{e,ation}"?) even if it is rendered with <dl> in HTML, not only because a term can have multiple definitions, but because the name "definition list" comes with expectations that it works like an HTML definition list. The term "definition list" seems to be HTML-specific; presumably they had some reason not to call it "glossary". I for one tend to use (Pandoc, HTML) "definition lists" mostly for general itemization rather than glossaries, and I'm probably not alone.

jgm commented 1 year ago

There's also the angle that, if we treat that as specifciation of AST, then it's benefitial to keep to "dumb" TypeScript, and inhereting from a generic interface is a bit indirect.

That makes sense. I'm also up for putting it on the website, but I want to fine-tune the AST a bit first.

List.start is nullable, but it seems we can fill it during parsing with default value? That is, I think in the current impl start: 1 and start: undefined two ways to say the same thing, and we can avoid that?

Not exactly. Bullet lists, for example, simply don't have a start attribute, and it would be confusing to add one with the value 1.

We could have separate types for OrderedList and BulletList, as pandoc does in its AST. I don't know if that would be better. I was thinking of making DefinitionList its own type. (And maybe TaskList.)

{Right,Left}{Single,Double}Quote, {Em,En}Dash feel like they maybe don't pull their weight as separate types. They all have the same shape: substituting one string for another.

Yes, these are a bit weird and I'd been thinking of consolidating them. We do want to keep both the original text (e.g. straight quote) and an annotation like left_single_quote that can be used by the renderer, but it could be something like

{ tag: "smart_punctuation",
  character: "left_single_quote",
  text: "'" }

interface Symb extends HasAttributes { tag: "symbol"; } -- name of the type and the tag name is inconsistent. Wants to be tag: "symb" perhaps?

Probably should be, yes. It was originally Symbol but then I realized this is a native JS type.

children: (Term | Definition)[]; I think that's the right way to reperesent this in the AST, but it maybe makes sense to add a comment that there's at most one term/definition, and that the term is first.

Is there any way to enforce this in the types?

I'm a bit unhappy about this one, as well as the way we include a Caption as one of the children of a table, along with the Rows. One could make a case for something like

{ tag: "table",
  children: Row[],
  caption: Inline[] }

But with the current system children is the only thing we ever have to recurse into in the nodes, and that simplifies traversals and other things.

type AstNode = ... -- not sure that Footnote and Reference belong there, as they are not children.

Yes, I think I added them recently because I needed handlers for them in the pandoc module. We could alternatively invent a new type that includes AstNode and these. However, Footnote and Reference are AST nodes, even though they don't go in children: they go in footnotes and references, which are fields of the Doc element.

@bpj the way definition lists currently work, there can only be one definition (it's just everything after the first paragraph, which is treated as the term). I think that's probably okay for most purposes. Segmenting into multiple definitions would require a different syntax; if this is desirable, we should open a new issue to discuss it.

jgm commented 1 year ago

Is there a way to leverage the typescript type checking to produce a program that will validate a JSON document for conformity to the AST?

The djot CLI tool in djot.js will read -f ast, but it will happily accept a malformed one.

matklad commented 1 year ago

0.7 confidence, but, as far as I know, not really. You need to write “deserialization” code yourself, and, last time I looked, lsp impl for vscode (which has the same problem) did just that. TS type system is fully static, there’s nothing in compiled code to do runtime validation.

Two bad options are:

including tsc as a library and a runtime dependency
as the ast is somewhat uniform, we can at build time generate deserialization boilerplate.

matklad commented 1 year ago

That LSP thing:

https://github.com/microsoft/vscode-languageserver-node/blob/c91c2f89e0a3d8aa8923355a65a2977b2b3d3b57/types/src/main.ts#L224

On Monday, 2 January 2023, John MacFarlane @.***> wrote:

Is there a way to leverage the typescript type checking to produce a program that will validate a JSON document for conformity to the AST?

The djot CLI tool in djot.js will read -f ast, but it will happily accept a malformed one.

— Reply to this email directly, view it on GitHub https://github.com/jgm/djot/issues/95#issuecomment-1369227711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANB3M2EKZDSENYVACPQYS3WQNDHTANCNFSM6AAAAAAR6SMS2A . You are receiving this because you were mentioned.Message ID: @.***>

jgm commented 1 year ago

Argh, was afraid of that. I'm used to Haskell which is more serious about its types. I added a Makefile target to create a json schema from the typescript definitions, using typescript-json-schema. So this might be one path to automatic validation, though I haven't yet figured out how to use the schema. Also, the schema doesn't seem to be entirely accurate: it doesn't indicate which properties are optional.

jgm commented 1 year ago

OK, figured out how to validate using jsonschema; it seems that the default treatment is that all properties are optional; you must specify that they are required explicitly. I'll fiddle with the options in typescript-json-schema. [fixed this issue]

jgm commented 1 year ago

validate.js:

const fs = require("fs");
const Validator = require('jsonschema').Validator;
const v = new Validator;
const instance = 4;
const schema = JSON.parse(fs.readFileSync("djot-schema.json", "utf8"));
const input = JSON.parse(fs.readFileSync("/dev/stdin", "utf8"));
let errs = v.validate(input, schema).errors;
if (errs.length === 0) {
  console.log("Valid");
  process.exit(0);
} else {
  for (let i in errs) {
    let err = errs[i];
    console.log(err.stack);
  }
  process.exit(1);
}

jgm commented 1 year ago

Yes, see above, I'd already tried typescript-json-schema and it seems to work. Including validation in the cli program would require depending on something like jsonschema -- not sure about that yet.

bpj commented 1 year ago

I find this indispensable when writing/tuning JSON schemas: https://json-schema.org/understanding-json-schema/index.html Just make sure that you follow the specification which your tools understand. I believe Draft-7 should be safe in most cases.

matklad commented 1 year ago

I find this indispensable when writing/tuning JSON schemas:

uhu, and that’s why I think it makes more sense to TypeScript for the spec: that’s much more readable. Though, we should have JSON schema as well, because a) people would ask for that b) it accumulated a bit more tooling on top.

bpj commented 1 year ago

True JSON Schema gets hairy pretty quickly if you want to be more specific, but such is the price for precision in any language: the more precise the more conditions. I would agree that JSON Schema is a bit on the verbose side. Its way of referencing definitions in the same schema in particular is annoyingly verbose! I actually cheat by writing my schemas in YAML and using my own interpolation engine — e.g. ⁅name⁆ gets expanded to '#/$defs/name' — to get a cleaner working experience and converting to JSON for deployment. At least I think that YAML looks cleaner than JSON with a clearly hierarchical structure, less quotes and brackets etc.

bpj commented 1 year ago

Forgot to say I agree there should be a JSON schema because of its greater portability.

jgm commented 1 year ago

that’s why I think it makes more sense to TypeScript for the spec: that’s much more readable. Though, we should have JSON schema as well, because a) people would ask for that b) it accumulated a bit more tooling on top.

The approach I outline above, using typescript-json-schema, lets us have it both ways. A human-readable specification in typescript format, from which we can generate and publish a json schema that people can use for programmatic validation.

clbarnes commented 1 year ago

Not sure if this is the right place, but is it a goal for djot to move towards a more-or-less full representation of pandoc's AST? i.e. is djot to pandoc AST what asciidoc is to docbook?

jgm commented 1 year ago

No, djot's AST is djot-specific. However, it is possible to convert between djot's and pandoc's ASTs. The conversion isn't lossless because the ASTs are a bit different (e.g. djot allows attributes on every element).

bpj commented 1 year ago

IMO the conversion to Pandoc AST should wrap non div/span elements with attributes in a div/span which holds the attributes, as I believe Pandoc does with commonmark_x. @jgm would an issue (or even a pull request) for this be welcome?

jgm commented 1 year ago

sure.

jgm commented 1 year ago

This still wouldn't give us lossless conversion, unless we adapted a convention like adding a "wrapper" class to the div, so it could be recognized and stripped off in converting from pandoc AST to djot.

bpj commented 1 year ago

The wrapping could be made optional.

jgm / djot

Specification for AST #95