Closed jyn514 closed 4 years ago
This reduces memory usage in NodeData by a factor of 5 (80 -> 16), and usage in Node by a factor of 2 (120 -> 56).
This analysis is definitely too simplistic because it ignores the memory usage of Box
allocations. Yes Box<RefCell<String>>
is 8 bytes but the 32 bytes for RefCell<String>
are still stored somewhere. In some cases adding a pointer indirection can increase total memory usage, because the 8 bytes for that pointer also need to be stored.
The idea of boxing some enum variants is to reduce unused padding for other, smaller variants. The overall effect is very dependent on which variants are more frequent in some concrete use case. Here DocumentFragment
is likely the variant that profits the most from this PR, but I suspect it’s very uncommon in real-world document.
I suspect that unfortunately in this enum the largest variant, Element
, is typically also one of the most frequent. So I’d be inclined not to take a PR like this without some measurement or statistics that show that it actually helps.
So I’d be inclined not to take a PR like this without some measurement or statistics that show that it actually helps.
Would this do? https://github.com/kuchiki-rs/kuchiki/issues/73#issuecomment-652529953 (see https://github.com/servo/html5ever/issues/420#issuecomment-644236225 to replicate)
This analysis is definitely too simplistic because it ignores the memory usage of Box allocations. Yes Box<RefCell
> is 8 bytes but the 32 bytes for RefCell are still stored somewhere. In some cases adding a pointer indirection can increase total memory usage, because the 8 bytes for that pointer also need to be stored.
This means that Text
now uses a total of 40 bytes of memory, but that's still half of the 80 it was taking up before. All of the variants besides Element
Doctype
now use less memory overall than they did before. Element Doctype has now grown from 80 -> 88 which seems like a reasonable tradeoff.
Here is the size of the fields of each variant as of 596ecac60542546f18de5ff21072713f2ae33ec7. (For the size of the enum we need to add the discriminant, which is 8 bytes with padding because of field alignment.) I think this should have been the very first step of this discussion.
assert_eq!(64, size_of::<crate::ElementData>());
assert_eq!(32, size_of::<std::cell::RefCell<String>>()); // Text, Comment
assert_eq!(56, size_of::<std::cell::RefCell<(String, String)>>()); // ProcessingInstruction
assert_eq!(72, size_of::<crate::Doctype>());
assert_eq!(1, size_of::<crate::DocumentData>());
I’m gonna claim without backing data that elements and text are very common, comments somewhat uncommon, and everything else exceptional. (Processing instructions are an XML thing and are not generated by the HTML parser. Each document typically has one document node, and zero or one doctype.)
For uncommon enum variants we care about the direct size_of
which increases the padding for other variants. For common variants we care about the total memory use, including separate allocations.
With this I think it becomes apparent that boxing doctypes is a win. Processing instructions as well, if we get element data below 56 bytes.
For elements and text I think it’s much less obvious, since they are common. But I think we can explore ideas other than just adding Box
in a few places.
For example, how common are elements with zero attributes? If it’s more than 33% it would be a win to replace Attributes
in ElementData
with Option<Box<Attributes>>
, and store None
for those elements. However that’s ideally an implementation detail hidden from the public API.
assert_eq!(24, size_of::<crate::Attributes>());
Ultimately though, if memory usage for large documents is important enough you should consider using something other than Kuchiki.
In my view Kuchiki is mostly a proof-of-concept to show what integrating the html5ever
with and selectors
crates look like. Designing a tree data structure for this is full of trade-offs. I tried to make Kuchicki is not terrible at any use case, but that means it’s also often not great.
In particular:
&'arena Node
or a per-document Vec
and integer indices. Rc<Node>
has 16 bytes overhead per node.RefCell
s. Those have 8 bytes overhead each.RefCell
with Cell
also removes that overhead at the cost of a less convenient API. Cell::update
is still unstable but it’s fairly simple and doesn’t use unsafe
, so duplicating it (or a variation of it that relies on T: Default
and Cell::take
instead of T: Copy
and Cell::get
) is easy and not risky.children: Vec<NodeRef>
or children: Box<[NodeRef]>
instead of five links for parent/previous sibling/next sibling/first child/last child.children
sequences, forming the path from the root of the tree to a given node.And I’m sure there are other interesting data structure ideas I haven’t thought of.
I think this should have been the very first step of this discussion.
That would have been a good start, sorry about that.
With this I think it becomes apparent that boxing doctypes is a win. Processing instructions as well, if we get element data below 56 bytes.
🎉 🎉
If memory usage for large documents is important enough you should consider using something other than Kuchiki. In my view Kuchiki is mostly a proof-of-concept.
I understand your viewpoint. I'm not aware of any alternatives to Kuchiki though and it's about 2k lines of code - It'd be a shame to rewrite all that just for docs.rs. I'm not sure what the best way forward is.
@jyn514 To be fair, based on https://github.com/rust-lang/docs.rs/blob/2edd72e2ad317ff8725db1dfe303cb8bc4918255/src/utils/html.rs#L6-L36 there's a lot of kuchiki's implementation that you wouldn't need to replicate. You're basically looking at:
Any remaining bits could be replaced by manually walking the resulting parsed tree to find elements matching head and body, as far as I can see.
Alternatively, https://crates.io/crates/lol_html might be an option that better meets your needs out of the box.
lol_html looks like exactly what we want. Thanks!
This reduces memory usage in NodeData by a factor of 5 (80 -> 16), and usage in Node by a factor of 2 (120 -> 56).
Closes https://github.com/kuchiki-rs/kuchiki/issues/73