hellux / jotdown

A Djot parser library
MIT License
151 stars 10 forks source link

Add an AST #17

Open hellux opened 1 year ago

hellux commented 1 year ago

It is often useful to work with an AST rather than a sequence of events. We could implement an optional module that provides AST objects that correspond to the AST defined by the djot spec (https://github.com/jgm/djot.js/blob/main/src/ast.ts).

It would be useful to be able to create it from events, and create events from the AST so you can e.g. parse events -> create ast -> modify ast -> create events -> render events.

It could also be useful to read/write the AST from/to e.g. json. We may then be able to read/write ASTs identically to the reference implementation. It might also be useful in tests to match against JSON produced by the reference implementation. We should be able to automatically implement the serialization/deserialization using serde, and then the downstream client can use any serde-compatible format.

A quick sketch of what it could look like:

#[cfg(feature = "ast")]
pub mod ast {
    use super::Event;

    use std::collections::HashMap as Map;

    #[cfg(feature = "serde")]
    use serde::{Deserialize, Serialize};

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub struct Doc {
        children: Vec<Block>,
        references: Map<String, Reference>,
        footnotes: Map<String, Reference>,
    }

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub struct Reference {
        // todo
    }

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub struct Footnote {
        // todo
    }

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub struct Block {
        kind: BlockKind,
        children: Vec<Block>,
    }

    #[cfg_attr(feature = "serde", derive(Deserialize, Serialize))]
    pub enum BlockKind {
        Para,
        Heading { level: usize },
        // todo
    }

    pub struct Iter<'a> {
        // todo
        _s: std::marker::PhantomData<&'a ()>,
    }

    impl<'a> Iterator for Iter<'a> {
        type Item = Event<'a>;

        fn next(&mut self) -> Option<Self::Item> {
            todo!()
        }
    }

    #[derive(Debug)]
    pub enum Error {
        EventNotEnded,
        UnexpectedStart,
        BlockInsideLeaf,
    }

    impl<'s> FromIterator<Event<'s>> for Result<Doc, Error> {
        fn from_iter<I: IntoIterator<Item = Event<'s>>>(events: I) -> Self {
            todo!()
        }
    }

    impl<'a> IntoIterator for &'a Doc {
        type Item = Event<'a>;
        type IntoIter = Iter<'a>;

        fn into_iter(self) -> Self::IntoIter {
            todo!()
        }
    }
}

clientside:

let src = "# heading

para";

let events = jotdown::Parser::new(src);
let ast = events.collect::<Result<jotdown::ast::Doc, _>>().unwrap();
let json = serde_json::to_string(&ast);

assert_eq!(
    json,
    r##"
    {
      "tag": "doc",
      "references": {},
      "footnotes": {},
      "children": [
        {
          "tag": "para",
          "children": [
            {
              "tag": "str",
              "text": "para"
            }
          ]
        }
      ]
    }
    "##
);
clbarnes commented 1 year ago

I was going to suggest basing such an AST on the output of typify for the json-schema generated from typescript definitions in djot.js, but typify doesn't parse it.

Having an internal AST like this, as well as being able to consume and produce it in JSON form, would allow the use of jotdown as a library to write filters as standalone binaries:

djot -t json mydoc.dj | myrustbinary | djot -f json > index.html
hellux commented 1 year ago

I was going to suggest basing such an AST on the output of typify for the json-schema generated from typescript definitions in djot.js, but typify doesn't parse it.

It would be nice if the AST types could be generated automatically. The only work needed would be to convert between AST and events.

Having an internal AST like this, as well as being able to consume and produce it in JSON form, would allow the use of jotdown as a library to write filters as standalone binaries:

djot -t json mydoc.dj | myrustbinary | djot -f json > index.html

If one wants to manipulate an AST, I guess jotdown (which is mainly a parser) is not really needed here. Just need some AST types that can be serialized and deserialized.

bdarcus commented 1 year ago

I tried two additional conversion tools:

  1. quicktype, both with typescript and json schema input
  2. typester, which isn't intended to be used in production

None of them parsed (or at least completed), so am wondering if there's something funky about that ast definition?

If one wants to manipulate an AST, I guess jotdown (which is mainly a parser) is not really needed here. Just need some AST types that can be serialized and deserialized.

So in a scenario like this, jotdown would just be able to output the same AST as djot.js, and any filtering would be done independently?

I'm wanting to implement citation and bibliography processing using djot for this project I'm working on, once John adds supports for citations, so just wondering how that might work.

hellux commented 1 year ago

Neither of them parsed, so am wondering if there's something funky about that ast definition?

It might be worth creating an issue on the jgm/djot issue tracker. The definition could perhaps be modified to be more parser friendly, if it happens to have a very uncommon structure.

So in a scenario like this, jotdown would just be able to output the same AST as djot.js, and any filtering would be done independently?

I'm wanting to implement citation and bibliography processing using djot for this project I'm working on, once John adds supports for citations, so just wondering how that might work.

If an AST was implemented, one would be able to get an AST from the parsed jotdown events (https://docs.rs/jotdown/latest/jotdown/enum.Event.html) and then either

Then, one would be able convert the modified AST into jotdown events in order to e.g. render it to HTML.

Alternatively, one could use the reference implementation for parsing and rendering, and only use the AST structs for modifying the AST in Rust (if one could convert back and forth between Rust AST and JSON AST).

Currently, no AST is implemented, though. In the current state, filtering for jotdown has to be done directly on the streamed events.

bdarcus commented 1 year ago

I just posted a linked issue over there.

adaszko commented 3 months ago

Hi, I'm curious what's the current status of this issue. Is the path to implementation to recover AST from a stream of events or automatic generation from a schema definition still is the way forward?

hellux commented 3 months ago

Hi, I'm curious what's the current status of this issue. Is the path to implementation to recover AST from a stream of events or automatic generation from a schema definition still the way forward?

Hi Adam, the problems with the automatic type generators seem to be unresolved, the issues linked in this thread do not have any updates. I also tried the latest versions of quicktype and typify with the current djot schema and they fail in the same way as before.

I haven't looked into exactly what causes the failures but solving it may require changes to either one of the generators or the djot schema. It might just be easier to manually create the AST Rust types and update them if the schema changes (which seems to be seldom).

Either way, the AST types and serialization/deserialization to/from e.g. JSON can be implemented entirely independently from jotdown. However, if we want to be able to parse to and render from an AST using jotdown, we still need to implement conversion between the AST objects and jotdown events.

So, the way forward might be to just manually create some AST types that match the schema and then implement (in any order) Seralize/Deserialize and conversion to/from events.

bdarcus commented 3 months ago

I concluded that the schema automatically generated from the typescript code is less than ideal. Pretty sure that's why the conversion tools don't work correctly.

FWIW, I've used https://docs.rs/schemars/latest/schemars/ in a project of mine, and the schemas it produces seem much better.

hellux commented 3 months ago

I concluded that the schema automatically generated from the typescript code is less than ideal. Pretty sure that's why the conversion tools don't work correctly.

Might be better to just ignore the schema then and instead look at the actual TypeScript types and JSON that are used/produced by the reference implementation.

FWIW, I've used https://docs.rs/schemars/latest/schemars/ in a project of mine, and the schemas it produces seem much better.

I'm guessing schemars cannot help in our case of creating the AST types, as it is only for generating schemas from existing Rust types. But perhaps useful if we wish to improve the upstream schema after we've created the types in Rust.

clbarnes commented 1 week ago

I took a stab at implementing the AST: https://github.com/clbarnes/djot_ast

The code is a bit gross in order to maximise compatibility with the typescript impl. Most of the grossness is, at least, confined to serde stuff so shouldn't impact actual use of the AST. Integrating it with jotdown events is a task I haven't started yet.