kuchiki-rs / kuchiki

(朽木) HTML/XML tree manipulation library for Rust
MIT License
470 stars 54 forks source link

XML serializing of NodeRef #82

Closed hipstermojo closed 1 year ago

hipstermojo commented 3 years ago

While it's clear that Kuchiki is only interested in parsing and serializing HTML files, is there a way to serialize a NodeRef to an XML compliant output?

Sorry that this is off topic.

SimonSapin commented 3 years ago

I think XML parsing and serializing would definitely be in scope, it’s just that no-one got around to implement it.

In the meantime you can do it yourself in a crate that depends on both kuchiki and for example xml-rs. Serialization in particular is relatively easy: create an XML "writer" and emit events while traversing the tree with subtree_root_node.traverse_inclusive():

hipstermojo commented 3 years ago

Seems like an interesting task to attempt. I only even need to do this so I can have self-closing tags in the serialized file. In the case of the XmlEvents, might you know how I'd express such? For example I want to serialize an img element;

<img src="./foo.jpg"/>

Or should I ask the xml-rs maintainers

SimonSapin commented 3 years ago

It looks like xml-rs defaults to emitting a self-closing tag for an element with no content (and you can configure it) https://docs.rs/xml-rs/0.8.3/xml/struct.EmitterConfig.html#structfield.normalize_empty_elements

But it’s done indiscriminately. You can’t choose to do it for <img> and not other elements, as your example maybe implies. Are you trying to emit XHTML? Does the consumer of what you emit use an actual XML parser rather than an HTML parser? Do you want to emit a "polyglot" document that can be parsed as HTML or XHTML into the same DOM tree? That last one is rather subtle, xml-rs may not be appropriate. (But also: why?)

hipstermojo commented 3 years ago

Well I need kuchiki to serialize my NodeRef to xhtml so that I can build an epub from it. I actually have no issue with using it as is, where the serialized file is actually HTML since some EPUB readers are more lenient when parsing but in some readers like Foliate, parsing stops because tags that must be self closing aren't.

To address this, I wrote a pretty bad regex to replace img tags to become self closing which is very slow and hacky

SimonSapin commented 3 years ago

Ok, I don’t know much about the EPUB format and its reader compatibility issues.

If you’re resorting to regexes, consider not using xml-rs and writing your own serializer instead, based on traverse_inclusive and writing bytes (presumably in UTF-8) to std::io::Write. It’s not very complex and I’ll have full control on details of the XML syntax used and ability to add as many special cases as needed.

hipstermojo commented 3 years ago

I'm really looking to get rid of the regex so I could try and use the approach you've suggested of serializing with traverse_inclusive. I'd also be avoiding a whole crate for one job in the process. Last time I used traverse_inclusive, it was not as fun but I had just started using Kuchiki at the time. Would looking at the source code for serialize_to_file be a good starting point?

SimonSapin commented 3 years ago

.traverse_inclusive() returns an iterator of an enum that indicates either the start or end of a tree node. Then, looking at that node, you’d need to match on .data() to handle the various kinds of nodes.

Instead of .traverse_inclusive() you could also iterate over .children() of each node and make your function recursive.

hipstermojo commented 3 years ago

Last I checked, Rust did not have tail call optimization yet so I'll probably stick with traverse_inclusive in this case. I'd like to keep this issue open just in case I have other issues regarding serialization

SimonSapin commented 3 years ago

(Tree traversal based on iterating direct children + recursion is not tail-recursive anyway.)

hipstermojo commented 3 years ago

I tried using traverse_inclusive and it works just fine. Yay! I'm not fully certain if the content generated is fully XHTML compliant but that is for me to find out. I think this can be closed unless someone needs it for referencing

SimonSapin commented 1 year ago

I will soon archive this repository and make it read-only, so this issue will not be addressed: https://github.com/kuchiki-rs/kuchiki#archived