Rust user observations - Githubissues

qtfkwk commented 2 years ago

I've been testing both tree-sitter-md and tree-sitter-markdown over the past couple days and have the following observations. Note that each observation is standalone and they aren't listed in any particular order. Also note that my use case is general in that it is not specifically or only concerned with syntax highlighting but includes other uses such as general markdown parsing/processing similar to pulldown-cmark.

While tree-sitter-md (this crate) is more recently updated and maintained than tree-sitter-markdown, and depends on tree-sitter 0.20, its grammar seems more basic than tree-sitter-markdown's. In particular:
- Block node inline kinds are present, and there are inline node inline kinds, but there seems to be no kind for just the plain text (?), so this would require additional calculations to extract the plain text, or replace parts of the rendered content with the inline spans... (?)
- No support for tables due to strict commonmark spec (?)
- Unsure if there are others... (?)

tree-sitter-md uses a 2-pass approach which forces walking those multiple trees. Perhaps the MarkdownTree struct could provide a method to walk all of the trees? tree-sitter-markdown doesn't work like this, and there aren't any drawbacks afaik, because whether a node is a block or inline can be determined by its kind. Here's an example of usage for this crate, but tree-sitter-markdown does not need the nested bit to walk the inline tree(s).

let input = String::from("Markdown content here...");

let mut parser = MarkdownParser::default();
let tree = parser.parse(&input.as_bytes(), None).unwrap();

// walk block tree...
let mut cursor = tree.block_tree().walk();
'outer: loop {
    let node = cursor.node();

    // do something with node...

    // walk inline tree...
    if let Some(inline_tree) = tree.inline_tree(&node) {
        let mut inline_cursor = inline_tree.walk();
        'inline_outer: loop {
            let inline_node = inline_cursor.node();

            // do something with inline_node...

            if !inline_cursor.goto_first_child() {
                if !inline_cursor.goto_next_sibling() {
                    loop {
                        if !inline_cursor.goto_parent() {
                            break 'inline_outer;
                        }
                        if inline_cursor.goto_next_sibling() {
                            break;
                        }
                    }
                }
            }
        }
    }

    if !cursor.goto_first_child() {
        if !cursor.goto_next_sibling() {
            loop {
                if !cursor.goto_parent() {
                    break 'outer;
                }
                if cursor.goto_next_sibling() {
                    break;
                }
            }
        }
    }
}

In my initial testing, parsing 2 large but slightly different files both took roughly the same amount of time (50 ms) for each file, which should not be the case because it should only be processing the changes (right?). Using tree-sitter-markdown, it took 50 ms for the first one and 4 ms for the second one. Perhaps I'm doing something wrong?
Saw an error during cargo test: thread 'project::tests::doc_from_empty' panicked at 'Could not load injection query: QueryError { row: 0, column: 1, offset: 1, message: "inline", kind: NodeType }', tree-sitter-md-0.1.1/bindings/rust/lib.rs:113:14. I did not see this on a standalone attempt; it only cropped up when I combined both tree-sitter-md and tree-sitter-markdown into the same project, so would guess that is causing some specific conflict.

MDeiml commented 2 years ago

The big difference between tree-sitter-md and tree-sitter-markdown is, that the latter basically uses a handwritten C parser and then couples this to the tree-sitter parsing system with some hacks. That created a few problems for people and was also a bit unstable. That motivated me to write this parser, which mostly uses "normal" tree-sitter grammars.

There are some caveats though. This parser (tree-sitter-md) is slower and parsing has to follow some restricting rules.

First of all it's hard / maybe impossible to have explicit "text content" nodes. I might give this another try though. I think it's still ok, as the information is present, but I see the impracticality. Another solution would be to provide methods to extract text in the binding library.

The double tree structure is also caused by these restrictions. But please note that almost all markdown parsers use 2 passes (the spec even recommends to do so). Here as well it might make sense to abstract a bit more in the bindings and provide something like a MarkdownTreeCursor.

There are some inefficiencies still and I'm trying to work on those. For the testing you did in (3.) you should make sure though that you use MarkdownTree::edit to mark the changes and then pass the old tree to MarkdownParser::parse.

The error you mention in (4.) is definitely due to both libraries being loaded, as they both add C bindings with the same name and cargo can't deal with that. (That's ok though, as noone should probably use both libraries :D)

Finally there is already support for tables. I just didn't get around to pushing these changes to crates.io. (Also tables are still not tested very well)

MDeiml commented 2 years ago

Also, thanks! This is very valuable feedback

MDeiml commented 1 year ago

I implemented a lot of your feedback. I added MarkdownCursor to traverse the tree more easily and without having to think about the double structure. Also the newest release includes the grammar with support for pipe tables and more.

tree-sitter-grammars / tree-sitter-markdown

Rust user observations #61