utkarshkukreti / select.rs

A Rust library to extract useful data from HTML documents, suitable for web scraping.
MIT License
971 stars 69 forks source link

Add the ability to remove nodes. #41

Open XAMPPRocky opened 7 years ago

XAMPPRocky commented 7 years ago

In BeautifulSoup there is the ability to remove nodes from the scraper, this is valuable for removing certain kinds of text or elements from text.

utkarshkukreti commented 7 years ago

Yes, I would definitely like to have this feature. Unfortunately, this would require major changes in the internals of the crate and I'm not sure what a good design would look like at this point.

Right now Document has a vector of node::Row and node::Node has a reference to the Document and a usize index. This means allowing removing/inserting nodes will require some kind of Arena like structure so that removed spots are available for reuse by nodes inserted later. We'll also have to not store a reference to Document in Node so that the Document can be mutated while one of its Node exists. We could have Node be just an index like the petgraph crate does but that'll make many current APIs verbose, e.g. document[node].text() instead of node.text(). Or we could just go and wrap everything in Rc<RefCell<>> but I'd like to not do that if at all possible.

I'm open to suggestions!

sbeckeriv commented 4 years ago

would it be hard to just blank out the contents of the node? Or have a node::RemovedNode?

[edit] Soft deletes. https://github.com/sbeckeriv/select.rs/commit/da9b2451a54bd2ceeef61a23630cb689d958c44d I did not read all of the code to understand why this is a bad idea. Just proof of concept for my needs. [edit] not working as i would expect

            for mut node in &mut document
                .find(select::predicate::Name("noscript"))
                .borrow_mut()
            {
                node.delete();
dbg!(node);
            }

node shows deleted is true here but when the text() function is called it is not marked as deleted [edit] it might have worked my document wasnt first listed a mut. I moved to a local version that takes the index number of the notes i want and skips them in the text view. https://github.com/sbeckeriv/select.rs/commit/2bb9c9d9edddf4c593ea624c2e2147c92a7f0b08#diff-af08c3181737aa5783b96dfd920cd5ef70829f46cd1b697bdb42414c97310e13R143 i moved the function out of my fork and have a local text.