matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.88k stars 350 forks source link

Allow plugin for url feeding / pagination #109

Closed 0xgeert closed 5 years ago

0xgeert commented 9 years ago

I'm looking to use x-ray as my go-to solution for crawling. However for me one thing seems to be lacking: custom feeding of urls.

At the moment the only way to feed urls (and have x-ray chew away at them in parallel) is to use pagination. However, how to proceed if:

What would be the best way to hook in a customer 'url feeder' to tackle the above?

demiro commented 8 years ago

Yeah, I would like to know the same. Supposedly there is a list of categories on the root page that lead to the products lists... it's not really a pagination

sylvery commented 8 years ago

@gebrits You could do something like this:

  1. Request links from the database table
  2. Store the links in an array
  3. Within the .forEach() function of the array, you could use x-ray to scrape the links and get the information you require.

here's an example:

var queryResult = ["http://www.linkone.com", "http://www.linktwo.com","http://www.linkthree.com"]; // links from query to DB queryResult.forEach(function(link) { xray(link, selector)(callback) // for each link, xray crawls and passes the extracted information to //the callback function }

I hope this helps

kanethal commented 7 years ago

Thank you @sylvery , I've been wondering about this also. Anyone know if x-ray's throttle, delay options are honored when using .forEach()? The site I'm scraping will lockout if we don't rate limit. Or can you recommend a workaround (I'm still new and not sure how to build in a cycle delay). Thanks!

lathropd commented 5 years ago

Duplicate of https://github.com/matthewmueller/x-ray/issues/54. PRs welcome.