Open phoerious opened 3 years ago
@phoerious
Sorry to abuse the issue tracker as a support platform
This is fine. Where else can you ask for additional features?! :)
reference-count individual DOM nodes or at least check
I planned to do this after adding CSS Styles.
I'm writing a Python wrapper in which DOMNode instances hold references to DOM nodes and since the lifetime of this instance is independent of the actual lxb_dom_node_t* pointer, I have no way of checking whether the element is still valid and all I can do is to warn the user not to use existing DOMNodeinstances after DOM manipulations.
It seems that you will not be able to do this even with links. If you destroy a lxb_dom_node_t *
, you still won't get the number of links from it.
I already forgot how to write modules for C in Python. It seemed to me that Python itself monitors its objects and it has a callback for destruction (when the user does not need it, when all links to it are finished. Сallback will be called).
That is, you must return one object for the same lexbor DON Node until the user destroys it. For example:
head_1 = document.head
head_2 = document.find_by_name("head");
head_1 == head_2 # true
In Python C some like:
py_node *
py_document_head(..., lxb_html_document_t *document)
{
lxb_dom_node_t *node;
lxb_html_head_element_t *head;
head = lxb_html_document_head_element(document);
if (head == NULL) {
return NULL;
}
node = lxb_dom_interface_node(head);
/* We always check for the presence of a Python node. */
if (node->user == NULL) {
node->user = py_node_alloc(..., node);
if (node->user == NULL) {
/* Oh my God the Python out of memory. */
}
/* Set destroy callback for py_node `py_callback_to_destroy_node` */
}
return node->user;
}
py_node *
py_callback_to_destroy_node(..., py_node)
{
/* If the user asks for an object again, see the code above. */
py_node->ref_to_lxb_node->user = NULL;
}
Okay, I didn't know you could use the user
pointer for your own purposes, but it doesn't really solve my problem either. My logic is actually inverse to yours: I am constructing DOMNode
objects, which simply hold a pointer to the Lexbor DOM node. I can reconstruct that object at any point, as the only data that it holds is that pointer. Once the DOMNode
object goes out of scope, I check whether the referenced DOM node is orphaned and if yes, I delete it. If not, I don't do anything, since it will be deleted automatically when the DOM tree itself goes out of scope.
The two problems that can occur here are:
DOMNode
instance, whose lxb_dom_node_t
has been destroyed somewhere else (e.g. because the parent tree got deleted)DOMNode
object and deletes one of the copiesI could solve the second issue by ensuring that DOMNode
is a singleton for each individual lxb_dom_node_t*
pointer, but I cannot solve the first issue without maintaining an expensive mapping of all pointers to all DOM nodes and checking the whole subtree each time I manipulate a node.
@phoerious
You are trying to implement a very expensive and slow binding/wrapper approach. Above I described the fastest approach. Everything is created only when needed. At the same time, you have complete control over memory and all elements.
Oh, I am not creating any objects that are not needed. DOMNode
is only the public access point for the user. Everything else is done internally by Lexbor. Whether I store a Lexbor pointer in a Python class or a Python pointer (with callback) in a Lexbor object doesn't really make a difference performance-wise. But with your approach, I still need a wrapper around all that, because a Python user cannot use a C function directly.
What I am doing is basically what you are doing here:
py_node_alloc(..., node);
but I am not doing the reverse binding, because I don't need it (unless perhaps you are repeatedly retrieving the same node, which is rather rare and could cause dangling pointers in the opposite direction, since user
is not reference-counted).
BTW:
HTML parser benchmark <title> extraction:
=========================================
Resiliparse (Lexbor): 42015 documents in 36.55s (1149.56 documents/s)
Selectolax (Lexbor): 42015 documents in 37.46s (1121.52 documents/s)
Selectolax (MyHTML): 42015 documents in 53.82s (780.72 documents/s)
BeautifulSoup4 (lxml): 42015 documents in 874.40s (48.05 documents/s)
:heart:
@phoerious
Awesome! Resiliparse is public wrapper?
Yes. It's a web archive processing library and the wrapper is one new module.
Here's the repository: https://github.com/chatnoir-eu/chatnoir-resiliparse and here are the docs (stable version, no wrapper docs yet): https://resiliparse.chatnoir.eu/en/stable/
The new module hasn't been merged yet and I am still writing the documentation, but I'll tag a new version as soon as that is done and #125 has been released (so the CI can build the binaries and API docs). In the meantime, here is the module code:
In the end, I decided to use the user
pointer for ensuring that the wrapper objects are indeed singletons (although that's not necessarily important), but the primary issue remains: I cannot track whether the underlying DOM node is still alive.
I think some kind of smart ID could fix part of this problem.
We need to have a numerical ID for each node, but such an ID must have one important property: it should be possible to check if a given ID is not inserted into any parent node with another ID.
So, if a user deletes a big subtree, we keep the root ID of the subtree. Next time, if a user tries to access any child node that has already been wrapped, we need to check that the ID of the child node does not depend on the removed parent.
Not sure if such IDs exist for the tree data structures. This does not solve the reference counting problem, but at least it makes it easy to check removed nodes.
I would avoid anything that requires tree traversal (up or down). All I am interested in is whether the pointer that I currently hold is still valid. That could be done easily with some sort of weak smart pointer concept, where the user doesn't take ownership, but can rely on the reference either pointing to a valid object or NULL. Ideally that smart pointer type can be cast to a raw pointer, so you can use it as a drop-in replacement (similar to how Qt's QPointer works).
I would avoid anything that requires tree traversal (up or down)
This is why I called this smart ID, it should be possible to check if a child node is inside any given parent without making any traversals just by comparing two IDs (any parent/root, child)
I don't think you really need something fancy if you hold a weak smart pointer reference to a node in your wrapper. If the pointer is != NULL, you know that your node still exists, otherwise your wrapper is stale and you should throw an exception or return None or something.
The same applies to the parent pointer, but here you don't really need a smart pointer if Lexbor takes care of resetting all parent pointers to NULL on the removal of a node (in the end, the effect is the same, the pointer is either valid or NULL). Then if your wrapper goes out of scope and the referenced node does not have a parent, you deallocate the subtree (that is possible already and Resiliparse does exactly that). What's missing, though, is that a wrapper referencing another node in this now-deallocated subtree would need to update its references to NULL. That would happen automatically if the pointer was a weak smart reference.
You could argue that you don't want to delete the whole subtree if there is still a wrapper holding a reference to a (grand-)child node, in which case it's a bit more complicated, because you would need a reverse check which nodes are being held by a wrapper. All nodes without a wrapper would need to be deallocated and all wrapped nodes and their subtrees would be retained but orphaned. I don't think that's possible without a global lookup table of existing wrappers or without holding a strong (owning) smart pointer in the wrapper and instructing your recursive delete to deallocate only nodes where the reference count is exactly 1 (i.e. only one node referencing it strongly and no wrapper). So for simplicity and speed, I would prefer to simply deallocate the whole subtree if you delete a node. If you want to retain a child, you can easily remove it from the DOM first before deleting the parent.
For that reason, Resiliparse has no non-recursive decompose()
method (unlike Selectolax), because non-recursive node deletion will necessarily lead to memory leaks where orphaned subtrees without wrappers remain in memory until the owning document is deallocated.
I just noticed that I also cannot destroy unparented nodes in a DOMNode
wrapper object due to this bug/missing feature.
Let's take this example:
tree = HTMLTree.parse("...")
new_element = tree.create_element('p')
new_text = tree.create_text_node('Hello World!')
new_element.append_child(new_text)
# SEGFAULT
This will crash if I do a recursive delete in the destructor of the DOM node wrapper object, because new_element
will delete the node that is held by new_text
, which will then have a dangling pointer. If I reverse the order of initialization, it works, because new_text
will be deleted before new_element
.
tree = HTMLTree.parse("...")
new_text = tree.create_text_node('Hello World!')
new_element = tree.create_element('p')
new_element.append_child(new_text)
# OK
There is no solution to this without reference counting. The only thing I can do is not to delete orphaned nodes, which will leak memory until the tree is destroyed if there are orphaned nodes (which is probably an edge case, but ugly nonetheless). The two potential solutions that might come to mind don't work for these reasons:
I hope that emphasizes the importance of a reference counting feature for DOM nodes.
@lexborisov Any progress on this?
@phoerious
I'll deal with this soon.
How's the status? I have a pretty complete DOM API in Rust as a wrapper around lexbor now, but this thing is still bothering me.
I've checked the source code to find the best way reference counting could be implemented and I saw that some objects in the CSS parser module already have a ref_count attribute. Unfortunately, none of the DOM interfaces has this. Adding one to lxb_dom_node
would definitely break ABI compatibility and the counter would need to be updated in (some of) the calls to at least lxb_dom_*_create_interface
and lxb_dom_*_destroy_interface
. Perhaps it's also enough to do it in lxb_dom_node_destroy
.
I think this is doable, but it doesn't solve the problem of reallocs where existing pointers become dangling even if the object itself still exists. Fortunately, this doesn't seem to happen anywhere for any of the raw memory chunks and is used only for strings.
Another approach would be to add a lexbor_hash_t
to lxb_dom_document
which keeps track of created and destroyed objects within that document.
I can try to make a PR for this, but would like to know your favourite approach.
Hi @phoerious
Yes, I understand your need for it, but at the current stage I am doing layout and will implement the dependency counter from the needs of it. It took me a long time to figure out how it should be arranged, and it's definitely not via hash. As soon as the first rendering is ready, it will be clear what it should look like. I'm not sure yet. Any ideas are welcome!
I've made a quick and dirty draft of it here, which I'm testing at the moment. I believe this works for my use case, but it can certainly be improved. Since this changes the size of the lxb_dom_node
struct, it will crash if you use an incompatible library version.
See: https://github.com/lexbor/lexbor/compare/master...phoerious:lexbor:master
I spoke too soon. This solution gets me there most of the way, but when I deallocate attributes, I get stuck in an endless loop somewhere. There are just too many partially overlapping ways attributes are allocated and deallocated.
Fixed it. The new approach requires more code changes, but maintains the reference counts directly inside the *_interface_create
functions, which is more robust. This approach works for me and ASAN doesn't report any violations or leaks.
@phoerious
I'm not ready to accept such changes yet, although they seem to be true. I'll come back to this topic right after layout.
I'm fine with running this as a custom patch for now, it just makes it more difficult for others to compile from sources.
Sorry to abuse the issue tracker as a support platform, but is there a way to reference-count individual DOM nodes or at least check if a given
lxb_dom_node_t*
pointer is still valid or has been destroyed in the meantime? If not, would this be a feature that could be added? The only thing I can find islexbor_mraw_reference_count()
, but that only seems to work for the whole document memory block.I'm writing a Python wrapper in which
DOMNode
instances hold references to DOM nodes and since the lifetime of this instance is independent of the actuallxb_dom_node_t*
pointer, I have no way of checking whether the element is still valid and all I can do is to warn the user not to use existingDOMNode
instances after DOM manipulations. I checked whether Selectolax has any smart logic to prevent dangling pointers, but it looks like Selectolax is not taking any precautions either.