rust-scraper / scraper

HTML parsing and querying with CSS selectors
https://docs.rs/scraper
ISC License
1.94k stars 109 forks source link

Select::parse fails due to borrow the css query #188

Open latot opened 3 months ago

latot commented 3 months ago

Hi, I has been using this library from a while, has been great.

Here a example:

fn a(document: &Html) -> Result<(), Box<dyn Error>> {
    let query = format!("div table thead tr:nth-child({}) th", 3);
    let selector = Selector::parse(&query)?;
    Ok(())
}

This function will fails, the reason is pretty simple, the function owns the query, but the error triggered by parse seems to have borrowed the string, so it can't return an error when the function owns the data.

The workaround right now is unwrap instead use error propagation.

There is some ways to handle this, one would be a parse with String which owns the data, other one is instead use a borrowed string for the error, convert &str to String and use that instead.

Thx!

adamreichold commented 3 months ago

You can continue to use error propagation, you just need to downgrade the to a string, e.g.

let selector = Selector::parse(&query).map_err(|err| err.to_string())?;

but note that for selectors known at compile-time, .unwrap() is somewhat reasonable as failing to parse such a selector is a logic error instead of a runtime error. (For your example of a dynamically generated selector, one would have to consider how much leeway there is, i.e. if you really only interpolate an u8 into the nth-child part, then .unwrap() might still be reasonable as calling code should be unable to produce an invalid selector by choosing the value of the u8.)

latot commented 3 months ago

Hi! thx for the answer.

Well, is not ideal change the error type for the error propagation, would be good be able to return the actual one.

unwrap, I think it would depends on the type of app you are constructing, if is a one that only you will use, is more reasonable to panic, just for a user end product we want to handle the error and do not to panic.

adamreichold commented 3 months ago

unwrap, I think it would depends on the type of app you are constructing, if is a one that only you will use, is more reasonable to panic, just for a user end product we want to handle the error and do not to panic.

This is not the distinction I am making: A CSS selector that is fixed when the program is built is not a runtime error but a construction error, i.e. a panic (that should never happen once the program is shipped) is entirely appropriate.

(Think of it this way, we could with sufficient effort create a proc macro which parses such a CSS selector during compilation and embeds some already validated form of it into the program so that no parsing at all needs to happen when it runs. Alas, just using .unwrap() is a convenient way to get almost the same result with very much less implementation effort.)

latot commented 3 months ago

I agree, this is because there is at least two types of CSS selectors.

I one hand, the ones that can be checked at compile time, here unwrap is good, and the best solution is a macro as you says, this could be a nice feature :)

The next one is more complex css selectors, where the string is created by some algorithm, which is impossible to be checked by at compile time, can only be checked at runtime, there the error must be propagated, ideally the original not, not a string for a good error handling.

LoZack19 commented 2 months ago

The idea of a macro is indeed interesting, I will leave this issue open in order to remember it. @cfvescovo

cfvescovo commented 2 months ago

Take a look at #53

cfvescovo commented 2 months ago

Since we had more requests for a macro implementation, maybe it's now time to add it to scraper

LoZack19 commented 2 months ago

53 is interesting because it explores the idea of a proc-macro. I think that, if it worked, it could be a cleaner solution than what we have right now. In that discussion, though, concerns had been raised about the necessity of implementing such a macro

cfvescovo commented 2 months ago

53 is interesting because it explores the idea of a proc-macro. I think that, if it worked, it could be a cleaner solution than what we have right now. In that discussion, though, concerns had been raised about the necessity of implementing such a macro

At the time, I chose to close #53 because another crate was actively being developed to address the issue. Now the repo of that crate is inactive.

I think we should provide an "official" implementation so that we can maintain and improve it over time. Moreover, it could be useful for beginners who don't want to roll out their own macro

LoZack19 commented 2 months ago

This claude-generated solution looks very clean and idiomatic to me, even much better than the one provided in the crate scraper-proc-macro

use proc_macro::TokenStream;
use quote::quote;
use syn::{parse_macro_input, LitStr};

#[proc_macro]
pub fn selector(input: TokenStream) -> TokenStream {
    let selector = parse_macro_input!(input as LitStr).value();

    match scraper::Selector::parse(&selector) {
        Ok(_) => quote!(
            ::scraper::Selector::parse(#selector).unwrap()
        ).into(),
        Err(e) => syn::Error::new(
            proc_macro2::Span::call_site(),
            format!("Failed to parse CSS selector: {}", e)
        ).to_compile_error().into(),
    }
}
LoZack19 commented 2 months ago

If integrated, would you put it inside the selector module or in a dedicated macro module? It could also be integrated as an additional macro feature

teymour-aldridge commented 2 months ago

If integrated, would you put it inside the selector module or in a dedicated macro module? It could also be integrated as an additional macro feature

I imagine it would sit behind a feature flag in a module? The implementation would probably be in a separate crate.

LoZack19 commented 2 months ago

I was reading that a proc_macro needs to live necessarely in a separate crate, so yes, this is true, it would be quite invasive as an addition though, but yet useful

LoZack19 commented 2 months ago

In order to avoid cyclic crate dependencies it would be better to ship scraper_proc_macros separately. Otherwise scraper would have to depend on it, which is not possible. Correct me if I'm wrong.

teymour-aldridge commented 2 months ago

In order to avoid cyclic crate dependencies it would be better to ship scraper_proc_macros separately. Otherwise scraper would have to depend on it, which is not possible. Correct me if I'm wrong.

It is possible, it will just require scraper_proc_macros to be published before scraper when doing releases in future?

adamreichold commented 2 months ago

This claude-generated solution looks very clean and idiomatic to me, even much better than the one provided in the crate scraper-proc-macro

I am not sure if I find that proc macro particularly helpful as while it does verify at compile time, it will just compile the selector again at runtime. If the dependency tree is sufficiently messy, it could even be that the verification at runtime uses a different version of the scraper crate and compilation at runtime still fails (admittedly a somewhat unlikely scenario).

I think what we should aim for is a proc macro that actually outputs a representation (maybe via serialization) of the already parsed selector so that work is not repeated at runtime.

I also think the span in the error case should tightened to indicate only the literal, not the whole macro invocation though.

cfvescovo commented 2 months ago

The macro should generate a selector representation directly, without requiring parsing at run time. Moreover, while it's true that the basic approach would require cyclic dependencies - which, as you now, are not allowed - the issue can be solved by writing three crates: selectors_core, selectors_macros (which will depend on core) and selectors, which will depend on both. Obviously, as this sounds a bit overcomplicated, feel free to suggest more straightforward approaches

cfvescovo commented 2 months ago

A simpler solution would be writing our macros in a different crate and let our users depend on this new crate in order to use them. However, I would prefer exposing our new macros directly from scraper behind a feature flag

LoZack19 commented 2 months ago

The macro should generate a selector representation directly, without requiring parsing at run time.

I confirm that the macro provided by Claude executes the parser at runtime, but it then re-executes it at compile time, so the concern that @adamreichold is raising is real, even though I personally thing that such a scenario would be highly unlikely, and it would still be an improvement over not having the macro at all.

LoZack19 commented 2 months ago

The selectors_core, selectors_macros solution would eliminate the version coherence problem between the compile time and the runtime versions of Selector::parse(). Therefore I would prefer this solution over the one adopting indipendent crates.

latot commented 2 months ago

I still does not know much about macros, but it should only accept strings, not variables as inputs, or will cause the error posted above (first post).

Maybe after this, create a issue for css selectors from variables.

LoZack19 commented 2 months ago

Unfortunately, since SelectorsList does not implement ToTokens, we cannot reuse the result computed by parse at complite time. We would need to submit a PR to the maintainers of the selectors crate, for this feature to be implemented. As of now, I would still like to have the proc macro which recomputes the selector at runtime, but if we choose not to do that, I would close this issue.

We could also probably implement the parse logic inside a normal macro, but that is beyong the scope of this crate.

adamreichold commented 2 months ago

I think setting up the crate structure and using the runtime-based implementation would be fine until we can modify selectors to support an embeddable representation.

teymour-aldridge commented 2 months ago

To be honest I'm not 100% sure what benefit this macro confers? The syntax of CSS selectors is relatively straightforward and I think most people have to test their selectors anyway (to ensure they're picking thet right elements), something which this doesn't help.

Perhaps a builder might work better here?

adamreichold commented 2 months ago

The syntax of CSS selectors is relatively straightforward and I think most people have to test their selectors anyway (to ensure they're picking thet right elements), something which this doesn't help.

I agree that most problems we see during actual usage are not due to incorrect selector syntax but rather selectors not working as intended against the target DOM. The feedback for invalid syntax from a compile-time error is slightly faster though.

The macro (in its final form, not the intermediate check-at-compile-time-then-parse-again-at-runtime variant) would additionally provide efficiency gains insofar selectors would not have to parsed at runtime at all.

latot commented 2 months ago

There is some advantages.

  1. The actual Selector works internally with &str which blocks the error propagation, if a macro just uses a static &str would helps with that.
  2. We need to unwrap the selector to know is constructed well (without dynamic ones), this could be a lot simpler with a macro where at compile time we know it is right or not which is great.
  3. The macro could helps to split selectors in two, static ones and dynamic ones, well still if the selector uses String instead of &str for errors could be used one instead of two.

While more clear, better :3 at least is what I think.