Closed Boscop closed 4 years ago
The methods extract_urls_from_elements
and extract_data_from_elements
are two separate pieces of functionality (as explained below).
find_elements_with_urls
extract_urls_from_elements
Are used in combination to first find the elements that will, in some fashion, "contain" urls that you will want to extract using find_elements_with_urls
. You then actually extract the urls from those specific elements by calling extract_urls_from_elements
with one of the Extractor enums.
find_elements_with_data
extract_data_from_elements
Are used in combination to first find the elements that will "contain" the actual data that you will want to extract using find_elements_with_data
. You then actually extract the data from those specific elements by calling extract_data_from_elements
with one of the Extractor enums.
Please let me know if that doesn't make sense. I did see your other issue related to having the ability to basically do it all at once so you can have reference to the hrefs/urls that are associated with the given data, which definitely seems like a future addition to the public api. I hope this helps!
Yes, but isn't that what I'm doing? How am I using these functions wrongly?
The above example should print all href values of all <a>
elements, but it prints an empty vec.
You did not understand my explanation. to make it more understandable - here is your example, but using the correct combination of functions:
let start_urls =
vec![StartUrl::new().url("https://wikipedia.org").method("GET").response_logic(Parallel(vec![
Scrape::new()
.find_elements_with_data("a")
.extract_data_from_elements(ElementDataExtractor::Text)
.store(|vec: Vec<String>| async move {
println!("list of inner text data from the links (not the hrefs themselves): {:?}", vec);
}),
]))];
let mut builder = QuickCrawlerBuilder::new();
let limiter = Limiter::new();
builder.with_start_urls(start_urls).with_limiter(limiter);
let crawler = builder.finish().map_err(|_| "Builder could not finish").expect("no error");
let res = async_std::task::block_on(async { crawler.process().await });
Also, as indicated by my response to your other feature-request, the only data provided to the store
closure when scraping is the data inside of elements (the inner text). I do like your idea to be able to support traversing nodes, finding data to store, and finding urls to traverse to next, all within the same callback. As of right now, that is not supported.
@mkralla11 Ah. In the above example, what I wanted to do is print the URLs, not their inner text. The above example came about because I wanted to debug my crawler, to see why the relative urls weren't working. I can imagine other use cases where it makes sense to print the urls (href attribs), e.g. when you want to pipe the urls to a file, to later download them with wget or do indexing or whatever.
Btw, in #2 I added a more thought-out idea for the design, that would also allow the printing of the URLs..
Btw, in my real-world use case, I have certain a
elements as leaves that I need to extract data from, and I also need the href
attribute, because those represent downloadable files (not further sites to scrape). So my handler wants to add those links to a separate task pool to download those files (but it can only do that if it has the href
attribute of the leaves).
Any idea why this doesn't work?
It only prints: