mkralla11 / quick_crawler

A configurable async Rust crate that provides a simple way to declaratively navigate to multiple webpages, scrape contents, and follow links to scrape more..
14 stars 2 forks source link

Custom User Agent? #1

Closed Boscop closed 4 years ago

Boscop commented 4 years ago

Thanks for this crate, it seems very useful :)

It would be useful to be able to specify a custom user agent header, because some sites require a specific user agent when crawling..

mkralla11 commented 4 years ago

I totally agree! I actually plan to make the request library under-the-hood completely agnostic, to allow developers to not only add custom user agents, but to also be able to allow any request-related configuration to every request. The api I plan to support would look something like this:

    builder
        .with_start_urls(
            start_urls
        )
        .with_limiter(
            limiter
        )
       // the closure will be provided other args as well, still fleshing out the api
       .with_request_handler(|url: String| async move {
            // ... use any request library, like reqwest
           surf::get(url).recv_string().await
      });

While I'm working on the updates for that api, and if you have time, I would totally accept any PR's for allowing custom request headers/user-agent information. The api could be as simple as:

// .. prev code redacted
    builder
        .with_start_urls(
            start_urls
        )
        .with_limiter(
            limiter
        )
       .with_user_agent(
           "Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0"
       );

By utilizing the builder to store this meta data, all requests within our given quick_crawler would then use this given user-agent. You could also make the user-agent configurable per-request by adding a method to both StartUrl impl and Scrape impl, which would be super useful as well!

Let me know if you have time to work on the user-agent feature and I can assign it to you :)

Also, if you like the library, feel free to star it! I love contributing to OSS especially when I know it's helping other engineers.

Boscop commented 4 years ago

Sounds like a great roadmap :) I haven't used this crate yet, but I plan to use it soon, but I'm not yet sure if it'll work for my use case, but it looks very promising.. I'm not sure when I'll have some time though.. Another feature that would be useful, is recursive crawling (not requiring selectors) that just follows all links and calls a given closure with the fetched target (no matter which content-type). The closure can then do something with it, and if the content-type is html, the crawler can recursively scan this page for more links. There would be a hysteresis mechanism to avoid infinite recursion: Based on how many links were followed since the last "hit" (as determined by the closure). This can be very useful to write layout agnostic mass-downloaders for specific file types, like a wget-replacement with auto-depth/hysteresis (e.g. terminate after nothing was found after following 3 links). Optionally one could disallow following links to other hosts etc. The problem with wget is, it can't filter by mime type, treats file extensions case-sensitively, isn't parallelized, can't filter by mime type and when filtering by file extension it can't find files behind links that don't contain the file extension, such as this. I'm kinda looking for a layout-agnostic parallelized & auto-recursive wget replacement with better filtering.. Or can this crate in its current state already be used in a selector-agnostic way, just yielding all hrefs in a recursive way? :)

mkralla11 commented 4 years ago

Configurable RequestHandler async closure - allows any request library to be used to fetch a webpage, with any user agent, and return a string of HTML. https://github.com/mkralla11/quick_crawler/commit/af9b175d50cf41ad7dd865c584460cfb79f20bae