mkralla11 / quick_crawler

A configurable async Rust crate that provides a simple way to declaratively navigate to multiple webpages, scrape contents, and follow links to scrape more..
14 stars 2 forks source link

Doesn't seem to work for me #3

Closed Boscop closed 4 years ago

Boscop commented 4 years ago

Any idea why this doesn't work?

let start_urls =
    vec![StartUrl::new().url("https://wikipedia.org").method("GET").response_logic(Parallel(vec![
            Scrape::new()
                .find_elements_with_urls("a")
                .extract_urls_from_elements(ElementUrlExtractor::Attr("href".to_string()))
                .store(|vec: Vec<String>| async move {
                    println!("store: {:?}", vec);
                }),
        ]))];
let mut builder = QuickCrawlerBuilder::new();
let limiter = Limiter::new();
builder.with_start_urls(start_urls).with_limiter(limiter);
let crawler = builder.finish().map_err(|_| "Builder could not finish").expect("no error");
let res = async_std::task::block_on(async { crawler.process().await });

It only prints:


store: []
mkralla11 commented 4 years ago

The methods extract_urls_from_elements and extract_data_from_elements are two separate pieces of functionality (as explained below).

find_elements_with_urls extract_urls_from_elements Are used in combination to first find the elements that will, in some fashion, "contain" urls that you will want to extract using find_elements_with_urls. You then actually extract the urls from those specific elements by calling extract_urls_from_elements with one of the Extractor enums.

find_elements_with_data extract_data_from_elements Are used in combination to first find the elements that will "contain" the actual data that you will want to extract using find_elements_with_data. You then actually extract the data from those specific elements by calling extract_data_from_elements with one of the Extractor enums.

Please let me know if that doesn't make sense. I did see your other issue related to having the ability to basically do it all at once so you can have reference to the hrefs/urls that are associated with the given data, which definitely seems like a future addition to the public api. I hope this helps!

Boscop commented 4 years ago

Yes, but isn't that what I'm doing? How am I using these functions wrongly? The above example should print all href values of all <a> elements, but it prints an empty vec.

mkralla11 commented 4 years ago

You did not understand my explanation. to make it more understandable - here is your example, but using the correct combination of functions:

let start_urls =
    vec![StartUrl::new().url("https://wikipedia.org").method("GET").response_logic(Parallel(vec![
            Scrape::new()
                .find_elements_with_data("a")
                .extract_data_from_elements(ElementDataExtractor::Text)
                .store(|vec: Vec<String>| async move {
                    println!("list of inner text data from the links (not the hrefs themselves): {:?}", vec);
                }),
        ]))];
let mut builder = QuickCrawlerBuilder::new();
let limiter = Limiter::new();
builder.with_start_urls(start_urls).with_limiter(limiter);
let crawler = builder.finish().map_err(|_| "Builder could not finish").expect("no error");
let res = async_std::task::block_on(async { crawler.process().await });

Also, as indicated by my response to your other feature-request, the only data provided to the store closure when scraping is the data inside of elements (the inner text). I do like your idea to be able to support traversing nodes, finding data to store, and finding urls to traverse to next, all within the same callback. As of right now, that is not supported.

mkralla11 commented 4 years ago

2 linking this issue to your other because you describe what you are trying to do with a usecase. I would love to get that feature incorporated!

Boscop commented 4 years ago

@mkralla11 Ah. In the above example, what I wanted to do is print the URLs, not their inner text. The above example came about because I wanted to debug my crawler, to see why the relative urls weren't working. I can imagine other use cases where it makes sense to print the urls (href attribs), e.g. when you want to pipe the urls to a file, to later download them with wget or do indexing or whatever.

Btw, in #2 I added a more thought-out idea for the design, that would also allow the printing of the URLs..

Boscop commented 4 years ago

Btw, in my real-world use case, I have certain a elements as leaves that I need to extract data from, and I also need the href attribute, because those represent downloadable files (not further sites to scrape). So my handler wants to add those links to a separate task pool to download those files (but it can only do that if it has the href attribute of the leaves).