kuchiki-rs / kuchiki

(朽木) HTML/XML tree manipulation library for Rust
MIT License
470 stars 54 forks source link

Question on removing whitespace text nodes #78

Closed hipstermojo closed 1 year ago

hipstermojo commented 4 years ago

I have tried removing text nodes containing only whitespace using

let node = kuchiki::parse_html().one(html_str);
node.descendants().filter(|x| match x.data(){
   NodeData::Text(t) => t.borrow().trim().is_empty(),
   _ => false,
}).for_each(|x| x.detach());

however, the text nodes still end up being there when I later traverse the nodes. Is there a better way of handling this?

jdm commented 3 years ago

When I run the following program:

use kuchiki::traits::TendrilSink;
use kuchiki::NodeData;

fn main() {
    let node = kuchiki::parse_html().one("<div>a</div> <span></span>");
    node.descendants().filter(|x| match x.data(){
       NodeData::Text(t) => t.borrow().trim().is_empty(),
       _ => false,
    }).for_each(|x| x.detach());
    for n in node.inclusive_descendants() {
    println!("{:?}", n);
    }
}

I get:

NodeRef(Document(DocumentData { _quirks_mode: Cell { value: Quirks } }) @ 0x7fe053405bd0)
NodeRef(Element(ElementData { name: QualName { prefix: None, ns: Atom('http://www.w3.org/1999/xhtml' type=static), local: Atom('html' type=static) }, attributes: RefCell { value: Attributes { map: {} } }, template_contents: None }) @ 0x7fe053406010)
NodeRef(Element(ElementData { name: QualName { prefix: None, ns: Atom('http://www.w3.org/1999/xhtml' type=static), local: Atom('head' type=static) }, attributes: RefCell { value: Attributes { map: {} } }, template_contents: None }) @ 0x7fe0534060c0)
NodeRef(Element(ElementData { name: QualName { prefix: None, ns: Atom('http://www.w3.org/1999/xhtml' type=static), local: Atom('body' type=static) }, attributes: RefCell { value: Attributes { map: {} } }, template_contents: None }) @ 0x7fe053406150)
NodeRef(Element(ElementData { name: QualName { prefix: None, ns: Atom('http://www.w3.org/1999/xhtml' type=static), local: Atom('div' type=static) }, attributes: RefCell { value: Attributes { map: {} } }, template_contents: None }) @ 0x7fe0534061e0)
NodeRef(Text(RefCell { value: "a" }) @ 0x7fe053406280)
NodeRef(Element(ElementData { name: QualName { prefix: None, ns: Atom('http://www.w3.org/1999/xhtml' type=static), local: Atom('span' type=static) }, attributes: RefCell { value: Attributes { map: {} } }, template_contents: None }) @ 0x7fe0534063b0)

Which includes no empty text nodes. Could you provide a complete example that demonstrates the problem you see?

hipstermojo commented 3 years ago

I found that the problem was actually concerning removing nodes in general. The code stops detaching nodes after the first one is detached if you use a for loop. You can just close this issue. Sorry I forgot to close it myself

SimonSapin commented 3 years ago

I’d expect that detaching the current node would interact badly with the tree traversal algorithm used by .descendants().

The code stops detaching nodes after the first one is detached if you use a for loop.

Indeed it might even stop iterating entirely, as if it had finished going through the whole (sub)tree.

Consider using .descendants().filter(…).collect::<Vec<_>>() to first find all the nodes to remove, and only then call .detach() on them.

The docs for descendants and friends should probably be improved to mention this surprising behavior and the work-around.

hipstermojo commented 3 years ago

That sounds great. Personally my workaround was to use while loops where I set the next element to be accessed before detaching. I found this to be the only way that worked for me. Especially if deletion wasn't the only thing I was doing in a given section of code. I'm still iffy about using detach on any other loop.

SimonSapin commented 1 year ago

I will soon archive this repository and make it read-only, so this issue will not be addressed: https://github.com/kuchiki-rs/kuchiki#archived