Question about unvisited nodes:

vsch / flexmark-java

CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.

BSD 2-Clause "Simplified" License

2.28k stars 270 forks source link

Question about unvisited nodes: #379

Open Parth opened 4 years ago

Parth commented 4 years ago

I'm trying to understand the behavior of NodeVisitor a bit more deeply. Specifically what happens when you don't handle one of the nodes. I've dug through the wiki & javadocs and come across 2 relevant remarks:

VisitorSample.java

any node type not handled by the visitor will default to visiting its children

NodeVisitor javadoc

Node visitor that visits all children by default

Like the VisitorSample, I'm trying to create a simple list of NodeVisitors, but ideally I'm trying to add a "catch-all" for the nodes I haven't visited (implemented yet). Ideally I'd like to not have an exhaustive list of all the nodes.

I've tried things like Node but as visit's documentation states, it'll just visit it's children so that doesn't work (and that's expected). However even if I use visitNodeOnly (with a node visitor) it doesn't visit things like CodeFencedBlock.

Thanks, Parth

vsch commented 4 years ago

@Parth, the visitor will visit children of all nodes that you do not explicitly handle. The reason is simple. The nodes you handle can be children of nodes you don't. So if default does not visit children, the visitor will be useless without you explicitly handling all nodes.

I am assuming you are using version 0.50.x.

To intercept all nodes whether you handle them or not, override the visit(Node) method of NodeVisitor. It gets called for every node visited and dispatches to specific handler or visitChildren(), don't forget to call super.visit(node) unless you handle the dispatching to handler and/or visitChildren in your implementation. Take a look at the code for the class it is very short. Documentation is not always up to date, the code is.

The details change in version 0.60 with a new visitor model but the same functionality is still available for processing all nodes before dispatching to specific handler.

The new implementation is more flexible because you can create your own processing specific visitors where the visit function has processing specific arguments, not just the node. All without having to re-implement the base classes, handlers and the rest. This was the purpose for the rewrite so I can easily create processing visitors for HtmlRenderer, Formatter, DocxConverter by reusing the visitor model.

Parth commented 4 years ago

okay so if I'm understanding correctly: I can override visit to get al nodes regardless of whether they've been visited or not & regardless of whether they're leaf nodes or not.

I can override visitChildren to find nodes that don't have a handler, but their children might actually be handled.

Am I missing something: with 0.50 there's no easy way to say: "I want all leftover un-visted nodes" right?

vsch commented 4 years ago

okay so if I'm understanding correctly: I can override visit to get al nodes regardless of whether they've been visited or not & regardless of whether they're leaf nodes or not.

There is no distinction between leaf and non-leaf nodes other than leaf nodes happen not have children. Nothing stops you from adding child nodes to any node, including a leaf node.

Nodes are not visited in the handler until NodeVisitor visit(Node) dispatches the call to handler.visit(node). What you see as the visit for the node happens in the NodeVisitor.visit(Node) which gets the registered handler and if there is one invokes visit(node). So all node visits go through this method, the visit in the handler only happens when this method invokes it.

Here is the implementation in NodeVisitor:

    public void visit(Node node) {
        VisitHandler<?> handler = getHandler(node);
        if (handler != null) {
            handler.visit(node);
        } else {
            visitChildren(node);
        }
    }

Am I missing something: with 0.50 there's no easy way to say: "I want all leftover un-visted nodes" right?

I don't understand "no easy way ...". The implementation of visit(Node) is literally 8 lines and in there you have the handler !=null test which means there is a registered handler and in the else branch, is the processing for your left-over nodes which invokes visitChildren() for these nodes.

Just override the visit(Node) method and insert your code in the else branch. You can do processing before and/or after visitChildren() for the left-over nodes.

I would consider this as the easy way of getting all left-over nodes.

I can override visitChildren to find nodes that don't have a handler, but their children might actually be handled.

You cannot override the visitChildren because then any handler doing a visitChildren will be running your code. Leave the visitChildren to invoke visit on each child node unless you have specific need to do something else.

Here is the code for visitChildren(), no magic, just go through each child and invoke visit(Node) on it, which goes to the above function, fetches a handler and does default visitChildren if there is no handler:

    public void visitChildren(Node parent) {
        Node node = parent.getFirstChild();
        while (node != null) {
            // A subclass of this visitor might modify the node, resulting in getNext returning a different node or no
            // node after visiting it. So get the next node before visiting.
            Node next = node.getNext();
            visit(node);
            node = next;
        }
    }

So the meat of visiting all nodes happens by recursive calls of these two functions with a sprinkling of calls to registered handlers.

Parth commented 4 years ago

Hey, appreciate you taking the time to help me.

There's still a gap in my understanding:

    public void visit(Node node) {
        VisitHandler<?> handler = getHandler(node);
        if (handler != null) {
            handler.visit(node);
        } else {
            // Add my code here for a "left over node"
            visitChildren(node);
        }
    }

I understand this is where I would put code for nodes with no handlers, but what I'm trying to express is: wouldn't I get caught up on higher-level "container-like" nodes? Nodes like Paragraph won't have handlers, but will get caught up here.

I would love to be overlooking something simple, but I still don't see a clear way generate a list of nodes who don't have handlers and their children don't have handlers either.

Maybe if visit didn't return void I could start see a helper function that marked nodes or something like that

vsch commented 4 years ago

@Parth, now I am starting to see what you are trying to do. The only thing I do not understand is what are you trying to accomplish?

A node may not have a handler but some of the children do but that means nothing for the node. It is still missing a handler.

Nodes like paragraph are pure containers and have no real rendering information other than a paragraph wrapper.

Some nodes can contain children but must wrap them in their custom parental implementation.

Can you explain what you are trying to accomplish and what is your use case?

The tree visiting is a recursive process so if you want to resolve all nodes you will need to first traverse the tree and build your data structures for the whole tree. Then you would analyze your data structure to see what is missing.

Parth commented 4 years ago

Sorry for not being clear: I'm working on: https://github.com/lockbook/lockbook-dev-desktop it's a security first live markdown rendering notes app. (checkout releases if you want to play around with it, early stages of dev)

I'm using flexmark + richtextfx for the live markdown syntax.

This approach worked fine for the first pass but I'm looking to make 2 changes at this point in time.

I want to capture all Text, but it seems my definition of text is more broad then flexmark's TextNode. My definition is: "all the text I haven't captured". I need to sort this out to make sure things that change from being a formatted node -> an unformatted node (deleting a tick from a code block) accurately have their style reset to the default text formatting.
Flexmark is incredibly fast, but richtextfx is not. Document updates out of my visitor are very slow (~100ms+ depending on document size). Now this is primarily because I'm resending many of the same style document updates again and again. With this round of development I'm looking to cache redraws and try to determine: "if we've already styled this node, don't restyle it".

I have a handful of ideas for doing this, but my first attempt was going to be to maintain a linked list of hashcodes for all the top level nodes (output of: parsed.getChildren). If any of the hashes change re-style this node, and all subsequent nodes. Or something like this.

But these are really 2 separate issues, just figured I'd give you the whole context for both how I am using flexmark right now, and how I'd like to (need to) use it.

vsch commented 4 years ago

@Parth, thank you for the explanation. It is essential to have the right context otherwise nothing makes sense.

Effectively, you need a cache with invalidation when something changes. I completely understand because I have the same issue in Markdown Navigator pluiging for IntelliJ, except flexmark was not the issue but native IDE structures.

It is not an easy problem because small text changes can result in large AST changes.

What are you generating for richtextfx? I am not familiar with it. I think you will need an intermediate data format which will render quickly from flexmark visitor. Then the results would be compared to previously rendered structure and an incremental update generated for richtextfx.

I don't think you can go from AST in flexmark and generate what you need with caching concerns. You really need an intermediate flat format that you can easily compare. Recursive AST nature makes this impossible.

You can take a look at some classes I have in the library which I use to flatten the AST for post processing so that each processor does not traverse the whole nested AST looking for its node types. Instead I create flat lists by node type and parent type for all registered processors.

Each processor gets called only with its nodes and no traversal of the tree is done.

In your case you are interested in all nodes so the concept has to be modified. I would suggest looking into mapping the AST into a flat or hash mapped set of lists that you can compare.

The post processing is done by: PostProcessorManager.java: Line 58

vsch commented 4 years ago

BTW, I have been breaking my head trying to figure out a similar problem. I want to add incremental HTML dom updates for the HTML JavaFX WebView preview in the plugin.

Large files get rendered and updated whole. Seems like a waste if only a few changed dom nodes or their text can be changed during typing in a document. Even considered using react with its incremental dom update.

I think you are looking at something similar. React has an intermediate format for its dom that it uses to quickly update changed nodes in the browser's dom.

Parth commented 4 years ago

What would be the problem with the following strategy:

take a Document's .getChildren turn it into a linked list of hash codes.
If you did not have a linked list previously, style all the nodes.
Everytime text changes compare each child to the linked list, for the first child where the hash is different recompute it as well as subsequent nodes.

Potential problems I can imagine:

I couldn't find a hashcode implementation in the codebase. Is there a reason this doesn't exist that I'm not seeing, or was it simply not needed until now.
There are some pathalogical cases that mess this up (always appending to the head of the document, or incredibly nested markdown). I think these cases are rare, perhaps better could be done, and I would be willing to think more, but as is this would already be a pretty reasonable speedup for me.

Let me know your thoughts

vsch commented 4 years ago

@Parth, hashCode() was not implemented because I did not want to compare nodes, the default object hashCode is good enough. It makes no sense to compare nodes for equality on anything but the object address.

For your use you will need to implement your own hash code computation based on which properties of the node are relevant to you.

I think what you need is a flat list of all nodes built from a depth first traversal of the tree which will result in the order the elements appear in the markdown document, so it is natural.

Then you can compare two lists based on node type and nesting level to see if this node is changed.

It is a difference computation problem and unless you know how to do it you will probably need to lookup a few algorithms to make it efficient and correct. Otherwise you might have N^2 execution time or your caching will speed things up but might fail to detect stale data and not update when needed.