kuchiki-rs / kuchiki

(朽木) HTML/XML tree manipulation library for Rust
MIT License
470 stars 54 forks source link

Box most NodeData variants #74

Closed jyn514 closed 4 years ago

jyn514 commented 4 years ago

This reduces memory usage in NodeData by a factor of 5 (80 -> 16), and usage in Node by a factor of 2 (120 -> 56).

Closes https://github.com/kuchiki-rs/kuchiki/issues/73

SimonSapin commented 4 years ago

This reduces memory usage in NodeData by a factor of 5 (80 -> 16), and usage in Node by a factor of 2 (120 -> 56).

This analysis is definitely too simplistic because it ignores the memory usage of Box allocations. Yes Box<RefCell<String>> is 8 bytes but the 32 bytes for RefCell<String> are still stored somewhere. In some cases adding a pointer indirection can increase total memory usage, because the 8 bytes for that pointer also need to be stored.

The idea of boxing some enum variants is to reduce unused padding for other, smaller variants. The overall effect is very dependent on which variants are more frequent in some concrete use case. Here DocumentFragment is likely the variant that profits the most from this PR, but I suspect it’s very uncommon in real-world document.

I suspect that unfortunately in this enum the largest variant, Element, is typically also one of the most frequent. So I’d be inclined not to take a PR like this without some measurement or statistics that show that it actually helps.

jyn514 commented 4 years ago

So I’d be inclined not to take a PR like this without some measurement or statistics that show that it actually helps.

Would this do? https://github.com/kuchiki-rs/kuchiki/issues/73#issuecomment-652529953 (see https://github.com/servo/html5ever/issues/420#issuecomment-644236225 to replicate)

jyn514 commented 4 years ago

This analysis is definitely too simplistic because it ignores the memory usage of Box allocations. Yes Box<RefCell> is 8 bytes but the 32 bytes for RefCell are still stored somewhere. In some cases adding a pointer indirection can increase total memory usage, because the 8 bytes for that pointer also need to be stored.

This means that Text now uses a total of 40 bytes of memory, but that's still half of the 80 it was taking up before. All of the variants besides Element Doctype now use less memory overall than they did before. Element Doctype has now grown from 80 -> 88 which seems like a reasonable tradeoff.

SimonSapin commented 4 years ago

Here is the size of the fields of each variant as of 596ecac60542546f18de5ff21072713f2ae33ec7. (For the size of the enum we need to add the discriminant, which is 8 bytes with padding because of field alignment.) I think this should have been the very first step of this discussion.

    assert_eq!(64, size_of::<crate::ElementData>());
    assert_eq!(32, size_of::<std::cell::RefCell<String>>()); // Text, Comment
    assert_eq!(56, size_of::<std::cell::RefCell<(String, String)>>());  // ProcessingInstruction
    assert_eq!(72, size_of::<crate::Doctype>());
    assert_eq!(1, size_of::<crate::DocumentData>());

I’m gonna claim without backing data that elements and text are very common, comments somewhat uncommon, and everything else exceptional. (Processing instructions are an XML thing and are not generated by the HTML parser. Each document typically has one document node, and zero or one doctype.)

For uncommon enum variants we care about the direct size_of which increases the padding for other variants. For common variants we care about the total memory use, including separate allocations.

With this I think it becomes apparent that boxing doctypes is a win. Processing instructions as well, if we get element data below 56 bytes.

For elements and text I think it’s much less obvious, since they are common. But I think we can explore ideas other than just adding Box in a few places.

For example, how common are elements with zero attributes? If it’s more than 33% it would be a win to replace Attributes in ElementData with Option<Box<Attributes>>, and store None for those elements. However that’s ideally an implementation detail hidden from the public API.

    assert_eq!(24, size_of::<crate::Attributes>());

Ultimately though, if memory usage for large documents is important enough you should consider using something other than Kuchiki.

In my view Kuchiki is mostly a proof-of-concept to show what integrating the html5ever with and selectors crates look like. Designing a tree data structure for this is full of trade-offs. I tried to make Kuchicki is not terrible at any use case, but that means it’s also often not great.

In particular:

And I’m sure there are other interesting data structure ideas I haven’t thought of.

jyn514 commented 4 years ago

I think this should have been the very first step of this discussion.

That would have been a good start, sorry about that.

With this I think it becomes apparent that boxing doctypes is a win. Processing instructions as well, if we get element data below 56 bytes.

🎉 🎉

If memory usage for large documents is important enough you should consider using something other than Kuchiki. In my view Kuchiki is mostly a proof-of-concept.

I understand your viewpoint. I'm not aware of any alternatives to Kuchiki though and it's about 2k lines of code - It'd be a shame to rewrite all that just for docs.rs. I'm not sure what the best way forward is.

jdm commented 4 years ago

@jyn514 To be fair, based on https://github.com/rust-lang/docs.rs/blob/2edd72e2ad317ff8725db1dfe303cb8bc4918255/src/utils/html.rs#L6-L36 there's a lot of kuchiki's implementation that you wouldn't need to replicate. You're basically looking at:

Any remaining bits could be replaced by manually walking the resulting parsed tree to find elements matching head and body, as far as I can see.

jdm commented 4 years ago

Alternatively, https://crates.io/crates/lol_html might be an option that better meets your needs out of the box.

jyn514 commented 4 years ago

lol_html looks like exactly what we want. Thanks!