kuchiki-rs / kuchiki

(朽木) HTML/XML tree manipulation library for Rust
MIT License
470 stars 54 forks source link

How to get all text of document except text inside script/style/noscript tags? #85

Closed vedantroy closed 1 year ago

vedantroy commented 3 years ago

Kind of a cross-post of this.

I'm trying to get all visible text in a document (text that is not part of script/style/noscript tags).

I've come up with the following algo:

let parser = kuchiki::parse_html().one(content);
for child in parser.inclusive_descendants() {
    if let Some(el) = child.as_element() {
        let tag_name = &el.name.local;
        if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
            child.detach();
        }
    }
}
let text = parser.text_contents();
println!("{}", text);

However, this doesn't seem to work. parser.text_contents() still returns inline Javascript in style tags.

Am I using the detach API incorrectly?

SimonSapin commented 3 years ago

Are you sure this code doesn’t remove one element, then stops?

The tree traversal code inside inclusive_descendants basically assumes the tree doesn’t change while it’s running. After it’s done with a sub-tree it continues by following the "next sibling" link, and if there isn’t one the parent’s next sibling etc. But if a node has been detached it doesn’t have a next sibling or parent anymore, so the traversal code thinks it has reached the root again and stops.

At a high level this is an iterator invalidation bug. This same kind of bug could happen for example with a Vec if borrowing it through .iter() didn’t prevent taking &mut exclusive references to mutate it while iteration is going on. Here Kuchiki uses shared ownership with reference-counting so there isn’t such a compile-time check, but this is "only" a logic bug and there is no Undefined Behavior.

It’s been a while since I wrote this code, but maybe here https://github.com/kuchiki-rs/kuchiki/blob/f652e38b12cb0d33f7bb0565b6933a6e2823a0c5/src/iter.rs#L288-L290 instead of node.parent().map(…) returning None when there’s no parent, we should use expect to panic because reaching a node without a parent without reaching the next == next_back case above indicates that the tree has been modified like in your case and the traversal is now incorrect. Maybe. I’m not sure. This way your program would panic with a message that explains about mutating the tree while iterating it.

Anyway, back to your issue, consider first accumulating the nodes to detach into a Vec and only after the traversal is done detatching them.

vedantroy commented 3 years ago

I ended up using a recursive function as follows:

fn get_visible_text(root: NodeRef, processed_text: &mut String) {
    for child in root.children() {
        if let Some(el) = child.as_element() {
            let tag_name = &el.name.local;
            if tag_name == "script" || tag_name == "style" || tag_name == "noscript" {
                return;
            }
            get_visible_text(child, processed_text);
        } else if let Some(text_node) = child.as_text() {
            let text = text_node.borrow();
            processed_text.push_str(&text);
        }
    }
}

but I'll try out your solution too.

SimonSapin commented 1 year ago

I will soon archive this repository and make it read-only, so this issue will not be addressed: https://github.com/kuchiki-rs/kuchiki#archived