rchipka / node-osmosis

Web scraper for NodeJS
4.12k stars 245 forks source link

Parallel concurrent scrapping #115

Closed kulikalov closed 8 years ago

kulikalov commented 8 years ago

I'm looping through an array of urls (~12) with this script:

osmosis.get(url)
    .header('accept-language', (language === 'ru' ? 'ru-RU,ru;q=0.8,en;q=0.6' : 'en-US,en;q=0.8,ru;q=0.6'))
    .set({
      name: 'title',
      ogName: 'meta[property="og:title"]@content',
      ogDescription: 'meta[property="og:description"]@content',
      metaDescription: 'meta[name="description"]@content',
      ogImage: 'meta[property="og:image"]@content',
      metaImage: 'meta[name="image"]@content',
      headImage: 'head img@src',
      contentImage_1: '.content img@src',
      contentImage_2: '.image img@src',
    })
    .data(function(results) {
      console.log(results)
    // only first 5 items gets here
    })

For some reason this process always goes only from 1st till 5th url, and then it just ignores the rest. Why is this happening? I'm absolutely sure, this is not because of URL inconsistency: i've tried to change order.

Also, I've created a queue to make all of the requests to be executed one-by-one. It works, all of the queued urls being scraped. But, when i'm comparing execution time even for 5 items in parallel and non-parallel, it shows, that queue stretches execution duration 4+ (!) times.

In fact, i need to run much more concurrent scraping processes. Is there way to make it work properly?

rchipka commented 8 years ago

Fixed in 1.1.0