utkarshkukreti / select.rs

A Rust library to extract useful data from HTML documents, suitable for web scraping.
MIT License
971 stars 69 forks source link

Allow boxing Predicate by removing Sized requirement #35

Open jmcomets opened 7 years ago

jmcomets commented 7 years ago

Currently the Predicate trait requires Sized, which IIRC prevents it from being made into an object. As far as I know, there is no current workaround for this.

From what I can tell, the only reason for this seems to be predicate composition and(), or(), etc. matches() is the only method required for the actual document searching.

utkarshkukreti commented 7 years ago

Thanks for the report and pull request @jmcomets! What do you think of this solution? Do you think it solves this issue? If so, I'll merge that one because it doesn't add a new trait.

jmcomets commented 7 years ago

Nice going with those where clauses, I always forget you can add them on default trait methods. :)

Your solution looks better and works for me. +1

Footnote: this is not yet useful, as the various searching methods take an owned predicate, not a reference/cow.

utkarshkukreti commented 7 years ago

Thanks, I've pushed my solution to master.

Footnote: this is not yet useful, as the various searching methods take an owned predicate, not a reference/cow.

Can you explain what you're trying to do? Maybe there is a solution we can come up with based on that.

jmcomets commented 7 years ago

Here's something similar to what I'm trying to do:

let document = Document::from(r#"
<body>
    <link href="test.css"/>
    <script>test</script>
    <img src="foo.jpg"/>
</body>"#);

let sources: Vec<(_, Box<Predicate>)> = vec![
    ("image", Box::new(Name("img").and(Attr("src", ())))),
    ("script", Box::new(Name("script"))),
    ("stylesheet", Box::new(Or(Name("link").and(Attr("href", ())), Name("style")))),
];

for (label, predicate) in sources {
    let elements = document.find(predicate); // won't work, find expects a `P: Predicate`
}
utkarshkukreti commented 7 years ago

Ah. My first thought was to implement Predicate for &T where T: Predicate but unfortunately it conflicts with the F: FnMut(&Node) -> bool implementation. I solved a very similar problem in munch.rs by creating a new struct ByRef and using it when I wanted to use the trait from a reference. What do you think about implementing a similar solution here?

jmcomets commented 7 years ago

I tried that as well and was pretty surprised by the conflict with F: FnMut(&Node) -> bool. Did you find out why it's ambiguous?

Michael-F-Bryan commented 7 years ago

Not being able to box a Predicate is a massive pain when you want to make things reusable. For example, I'm currently wanting to inspect a tree-like structure of nested ul tags and want to be able to make a predicate which matches the elements 1, 2, or 3 levels down without having to copy/paste everywhere.

    let children_of_children_of_children: Vec<String> = document
        .find(
            Class("chapter")
                .descendant(Name("li"))
                .descendant(Name("li"))
                .descendant(Name("li"))
                .descendant(Name("a")),
        )
        .map(|elem| elem.text().trim().to_string())
        .collect();

My "solution" was to create a nested() function which recursively adds on a descendant(Name("li")), decrementing level each time until it reaches 0.

fn nested<P1, P2>(root: P1, level: usize, child: P2) -> impl Predicate
where
    P1: Predicate,
    P2: Predicate + Clone,
{
    if level == 0 {
        root.descendant(Name("a"))
    } else {
        let pred = Descendant(root, child.clone());
        nested(pred, level - 1, clone) 
    }
}

Of course this doesn't compile because the branches of the if-else statement have different types. The typical solution would be to return a boxed trait object, but that doesn't compile either.

You end up with strange compilation errors because the compiler tries to infer the return type to be the FnMut() and starts complaining about lifetime issues. This happens when you try boxing instead of using impl Trait too.