kuchiki-rs / kuchiki

(朽木) HTML/XML tree manipulation library for Rust
MIT License
470 stars 54 forks source link

XML support? #31

Closed Anonyfox closed 1 year ago

Anonyfox commented 7 years ago

It seems that xml5ever has now merged into html5ever somewhat ... does this mean that I can use kuchiki safely on xml documents (like RSS feeds)?

SimonSapin commented 7 years ago

The source code for xml5ever has moved into the html5ever repository, but they’re still two separate crates. There is no plan to merge the crates, and even if that happened users would still have choose in the API whether to parse HTML or XML5 (for example based on a Content-Type HTTP header).

So nothing has changed for XML in Kuchiki: it’s not supported. Multiple XML parsers in Rust exist, so adding support is a matter of integrating one of them.

By the way, xml5ever parses XML5, which is different from XML 1.0 or 1.1. Its syntax is intended to be a superset, but has far as I know no-one has used XML5 beyond small experiments so compatibility with real content is still to be established.

For Kuchiki I think I’d rather pick xml-rs. It self-describes as "mostly XML-1.0-compliant".

Anonyfox commented 7 years ago

Oh, okay. So, would it be possible to parse a XML document with xml-rs into the same NodeRef structures as you do with html right now? The provided API with kuchiki is really really great right now, except for one major pain point (1), and I'd love to use kuchiki in exactly the same way for some xml stuff, especially with the CSS3-selector functionality.


(1): getting/setting attributes and their values on Nodes/NodeRefs is very cumbersome currently. I can work around this, but currently it's a far stretch to call it ergonomic. This would be even more pressing when working with XML I think.

SimonSapin commented 7 years ago

Oh, okay. So, would it be possible to parse a XML document with xml-rs into the same NodeRef structures as you do with html right now?

Yes, exactly. I’d take a PR to add that in Kuchiki.

Regarding attributes: there’s probably some convenience methods to add. (Based on owning-ref to deal with RefCell borrows.)

Ygg01 commented 7 years ago

By the way, xml5ever parses XML5, which is different from XML 1.0 or 1.1. Its syntax is intended to be a superset, but has far as I know no-one has used XML5 beyond small experiments so compatibility with real content is still to be established.

I'm willing to admit, XML5 isn't as tested as I'd like it to be. If XML5 is incompatible with XML 1.0 or 1.1 documents (outside the DTD), that's a bug in XML5.

I'd be willing to work on integrating xml5 with kuchiki, even as a branch just to see how it pans out. For xml-rs, I'd need to learn it first.

Anonyfox commented 7 years ago

I don't know much about the different standards here, but for me it looks like fixing && integrating xml5 could be easier, since it's internals are modeled after html5, so a lot of code could be reused. Then again, I'm still learning rust. But in every case I'd be willing to commit real-world test cases/code and documentation for both, xml & html parsing.

SimonSapin commented 7 years ago

a lot of code could be reused

Since the two parser libraries already exist it’s not much code either way.

Ygg01 commented 7 years ago

I think the idea is, that it would be easier to write, not so much as code reused. Since html5 and xml5 are similar, it's easier for a novice to write it.

What's, in my mind, a bigger blocker for kuchiki, is the lack of XPath support, since most people want XPath or similar technology for selecting nodes. CSS selectors are great, but feel out of place for XML.

quininer commented 7 years ago

I try to make kuchiki use xml5ever. https://github.com/quininer/sanngaa

Ygg01 commented 7 years ago

@SimonSapin It seems there is a slight mismatch between xml5ever and html5ever. Namely, that the xml5ever has Processing Instructions and stores prefixes, but html5ever doesn't.

I got possible ideas how to solve it, but they include changing how TreeSink and QualName work.

SimonSapin commented 7 years ago

Yes, TreeSink and QualName are part of things that should probably be shared, as discussed in https://github.com/servo/html5ever/issues/261. Even if it means modifying/extending them.

SimonSapin commented 1 year ago

I will soon archive this repository and make it read-only, so this issue will not be addressed: https://github.com/kuchiki-rs/kuchiki#archived