Reference counting of DOM nodes

phoerious commented 3 years ago

Sorry to abuse the issue tracker as a support platform, but is there a way to reference-count individual DOM nodes or at least check if a given lxb_dom_node_t* pointer is still valid or has been destroyed in the meantime? If not, would this be a feature that could be added? The only thing I can find is lexbor_mraw_reference_count(), but that only seems to work for the whole document memory block.

I'm writing a Python wrapper in which DOMNode instances hold references to DOM nodes and since the lifetime of this instance is independent of the actual lxb_dom_node_t* pointer, I have no way of checking whether the element is still valid and all I can do is to warn the user not to use existing DOMNodeinstances after DOM manipulations. I checked whether Selectolax has any smart logic to prevent dangling pointers, but it looks like Selectolax is not taking any precautions either.

lexborisov commented 3 years ago

@phoerious

Sorry to abuse the issue tracker as a support platform

This is fine. Where else can you ask for additional features?! :)

reference-count individual DOM nodes or at least check

I planned to do this after adding CSS Styles.

I'm writing a Python wrapper in which DOMNode instances hold references to DOM nodes and since the lifetime of this instance is independent of the actual lxb_dom_node_t* pointer, I have no way of checking whether the element is still valid and all I can do is to warn the user not to use existing DOMNodeinstances after DOM manipulations.

It seems that you will not be able to do this even with links. If you destroy a lxb_dom_node_t *, you still won't get the number of links from it.

I already forgot how to write modules for C in Python. It seemed to me that Python itself monitors its objects and it has a callback for destruction (when the user does not need it, when all links to it are finished. Сallback will be called).

That is, you must return one object for the same lexbor DON Node until the user destroys it. For example:

head_1 = document.head
head_2 = document.find_by_name("head");

head_1 == head_2 # true

In Python C some like:

py_node *
py_document_head(..., lxb_html_document_t *document)
{
    lxb_dom_node_t *node;
    lxb_html_head_element_t *head;

    head = lxb_html_document_head_element(document);
    if (head == NULL) {
        return NULL;
    }

    node = lxb_dom_interface_node(head);

    /* We always check for the presence of a Python node. */

    if (node->user == NULL) {
        node->user = py_node_alloc(..., node);
        if (node->user == NULL) {
            /* Oh my God the Python out of memory. */
        }
        /* Set destroy callback for py_node `py_callback_to_destroy_node` */
    }

    return node->user;
}

py_node *
py_callback_to_destroy_node(..., py_node)
{
    /* If the user asks for an object again, see the code above. */
    py_node->ref_to_lxb_node->user = NULL;
}

phoerious commented 3 years ago

Okay, I didn't know you could use the user pointer for your own purposes, but it doesn't really solve my problem either. My logic is actually inverse to yours: I am constructing DOMNode objects, which simply hold a pointer to the Lexbor DOM node. I can reconstruct that object at any point, as the only data that it holds is that pointer. Once the DOMNode object goes out of scope, I check whether the referenced DOM node is orphaned and if yes, I delete it. If not, I don't do anything, since it will be deleted automatically when the DOM tree itself goes out of scope.

The two problems that can occur here are:

User still holds a reference to a DOMNode instance, whose lxb_dom_node_t has been destroyed somewhere else (e.g. because the parent tree got deleted)
User copies the DOMNode object and deletes one of the copies

I could solve the second issue by ensuring that DOMNode is a singleton for each individual lxb_dom_node_t* pointer, but I cannot solve the first issue without maintaining an expensive mapping of all pointers to all DOM nodes and checking the whole subtree each time I manipulate a node.

lexborisov commented 3 years ago

@phoerious

You are trying to implement a very expensive and slow binding/wrapper approach. Above I described the fastest approach. Everything is created only when needed. At the same time, you have complete control over memory and all elements.

phoerious commented 3 years ago

Oh, I am not creating any objects that are not needed. DOMNode is only the public access point for the user. Everything else is done internally by Lexbor. Whether I store a Lexbor pointer in a Python class or a Python pointer (with callback) in a Lexbor object doesn't really make a difference performance-wise. But with your approach, I still need a wrapper around all that, because a Python user cannot use a C function directly.

What I am doing is basically what you are doing here:

py_node_alloc(..., node);

but I am not doing the reverse binding, because I don't need it (unless perhaps you are repeatedly retrieving the same node, which is rather rare and could cause dangling pointers in the opposite direction, since user is not reference-counted).

phoerious commented 3 years ago

BTW:

HTML parser benchmark <title> extraction:
=========================================
Resiliparse (Lexbor):  42015 documents in 36.55s (1149.56 documents/s)
Selectolax (Lexbor):   42015 documents in 37.46s (1121.52 documents/s)
Selectolax (MyHTML):   42015 documents in 53.82s (780.72 documents/s)
BeautifulSoup4 (lxml): 42015 documents in 874.40s (48.05 documents/s)

:heart:

lexborisov commented 3 years ago

@phoerious

Awesome! Resiliparse is public wrapper?

phoerious commented 3 years ago

Yes. It's a web archive processing library and the wrapper is one new module.

Here's the repository: https://github.com/chatnoir-eu/chatnoir-resiliparse and here are the docs (stable version, no wrapper docs yet): https://resiliparse.chatnoir.eu/en/stable/

The new module hasn't been merged yet and I am still writing the documentation, but I'll tag a new version as soon as that is done and #125 has been released (so the CI can build the binaries and API docs). In the meantime, here is the module code:

https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/feature/html-parsing/resiliparse/parse/html.pyx

In the end, I decided to use the user pointer for ensuring that the wrapper objects are indeed singletons (although that's not necessarily important), but the primary issue remains: I cannot track whether the underlying DOM node is still alive.

rushter commented 3 years ago

I think some kind of smart ID could fix part of this problem.

We need to have a numerical ID for each node, but such an ID must have one important property: it should be possible to check if a given ID is not inserted into any parent node with another ID.

So, if a user deletes a big subtree, we keep the root ID of the subtree. Next time, if a user tries to access any child node that has already been wrapped, we need to check that the ID of the child node does not depend on the removed parent.

Not sure if such IDs exist for the tree data structures. This does not solve the reference counting problem, but at least it makes it easy to check removed nodes.

phoerious commented 3 years ago

I would avoid anything that requires tree traversal (up or down). All I am interested in is whether the pointer that I currently hold is still valid. That could be done easily with some sort of weak smart pointer concept, where the user doesn't take ownership, but can rely on the reference either pointing to a valid object or NULL. Ideally that smart pointer type can be cast to a raw pointer, so you can use it as a drop-in replacement (similar to how Qt's QPointer works).

rushter commented 3 years ago

I would avoid anything that requires tree traversal (up or down)

This is why I called this smart ID, it should be possible to check if a child node is inside any given parent without making any traversals just by comparing two IDs (any parent/root, child)

phoerious commented 3 years ago

I don't think you really need something fancy if you hold a weak smart pointer reference to a node in your wrapper. If the pointer is != NULL, you know that your node still exists, otherwise your wrapper is stale and you should throw an exception or return None or something.

The same applies to the parent pointer, but here you don't really need a smart pointer if Lexbor takes care of resetting all parent pointers to NULL on the removal of a node (in the end, the effect is the same, the pointer is either valid or NULL). Then if your wrapper goes out of scope and the referenced node does not have a parent, you deallocate the subtree (that is possible already and Resiliparse does exactly that). What's missing, though, is that a wrapper referencing another node in this now-deallocated subtree would need to update its references to NULL. That would happen automatically if the pointer was a weak smart reference.

You could argue that you don't want to delete the whole subtree if there is still a wrapper holding a reference to a (grand-)child node, in which case it's a bit more complicated, because you would need a reverse check which nodes are being held by a wrapper. All nodes without a wrapper would need to be deallocated and all wrapped nodes and their subtrees would be retained but orphaned. I don't think that's possible without a global lookup table of existing wrappers or without holding a strong (owning) smart pointer in the wrapper and instructing your recursive delete to deallocate only nodes where the reference count is exactly 1 (i.e. only one node referencing it strongly and no wrapper). So for simplicity and speed, I would prefer to simply deallocate the whole subtree if you delete a node. If you want to retain a child, you can easily remove it from the DOM first before deleting the parent.

For that reason, Resiliparse has no non-recursive decompose() method (unlike Selectolax), because non-recursive node deletion will necessarily lead to memory leaks where orphaned subtrees without wrappers remain in memory until the owning document is deallocated.

phoerious commented 3 years ago

I just noticed that I also cannot destroy unparented nodes in a DOMNode wrapper object due to this bug/missing feature. Let's take this example:

tree = HTMLTree.parse("...")
new_element = tree.create_element('p')
new_text = tree.create_text_node('Hello World!')
new_element.append_child(new_text)
# SEGFAULT

This will crash if I do a recursive delete in the destructor of the DOM node wrapper object, because new_element will delete the node that is held by new_text, which will then have a dangling pointer. If I reverse the order of initialization, it works, because new_text will be deleted before new_element.

tree = HTMLTree.parse("...")
new_text = tree.create_text_node('Hello World!')
new_element = tree.create_element('p')
new_element.append_child(new_text)
# OK

There is no solution to this without reference counting. The only thing I can do is not to delete orphaned nodes, which will leak memory until the tree is destroyed if there are orphaned nodes (which is probably an edge case, but ugly nonetheless). The two potential solutions that might come to mind don't work for these reasons:

Don't delete recursively: Assumes that every element in the subtree has a wrapper object, which is not necessarily the case. Unwrapped objects would stay in memory without a recursive delete.
Implement a custom recursive delete that resets all wrappers of child nodes: Only works in this explicit case, but not if the subtree has been modified by other means of which the wrapper isn't aware (e.g. by setting innerHTML of some node).

I hope that emphasizes the importance of a reference counting feature for DOM nodes.

phoerious commented 1 year ago

@lexborisov Any progress on this?

lexborisov commented 1 year ago

@phoerious

I'll deal with this soon.

phoerious commented 2 weeks ago

How's the status? I have a pretty complete DOM API in Rust as a wrapper around lexbor now, but this thing is still bothering me.

See: https://github.com/chatnoir-eu/chatnoir-resiliparse/tree/rust/resiliparse/resiliparse_rust/resiliparse-common/src/parse/html

phoerious commented 2 weeks ago

I've checked the source code to find the best way reference counting could be implemented and I saw that some objects in the CSS parser module already have a ref_count attribute. Unfortunately, none of the DOM interfaces has this. Adding one to lxb_dom_node would definitely break ABI compatibility and the counter would need to be updated in (some of) the calls to at least lxb_dom_*_create_interface and lxb_dom_*_destroy_interface. Perhaps it's also enough to do it in lxb_dom_node_destroy.

I think this is doable, but it doesn't solve the problem of reallocs where existing pointers become dangling even if the object itself still exists. Fortunately, this doesn't seem to happen anywhere for any of the raw memory chunks and is used only for strings.

Another approach would be to add a lexbor_hash_t to lxb_dom_document which keeps track of created and destroyed objects within that document.

I can try to make a PR for this, but would like to know your favourite approach.

lexborisov commented 2 weeks ago

Hi @phoerious

Yes, I understand your need for it, but at the current stage I am doing layout and will implement the dependency counter from the needs of it. It took me a long time to figure out how it should be arranged, and it's definitely not via hash. As soon as the first rendering is ready, it will be clear what it should look like. I'm not sure yet. Any ideas are welcome!

phoerious commented 2 weeks ago

I've made a quick and dirty draft of it here, which I'm testing at the moment. I believe this works for my use case, but it can certainly be improved. Since this changes the size of the lxb_dom_node struct, it will crash if you use an incompatible library version.

See: https://github.com/lexbor/lexbor/compare/master...phoerious:lexbor:master

phoerious commented 2 weeks ago

I spoke too soon. This solution gets me there most of the way, but when I deallocate attributes, I get stuck in an endless loop somewhere. There are just too many partially overlapping ways attributes are allocated and deallocated.

phoerious commented 2 weeks ago

Fixed it. The new approach requires more code changes, but maintains the reference counts directly inside the *_interface_create functions, which is more robust. This approach works for me and ASAN doesn't report any violations or leaks.

lexborisov commented 2 weeks ago

@phoerious

I'm not ready to accept such changes yet, although they seem to be true. I'll come back to this topic right after layout.

phoerious commented 2 weeks ago

I'm fine with running this as a custom patch for now, it just makes it more difficult for others to compile from sources.

lexbor / lexbor

Reference counting of DOM nodes #132