Parallel scraping results in misses & duplicates

mape / node-scraper

Easier web scraping using node.js and jQuery

MIT License

516 stars 63 forks source link

Parallel scraping results in misses & duplicates #6

Open deggis opened 13 years ago

deggis commented 13 years ago

What an awesome scraper platform! Got all geared up in no time.

However, as I found single page scraping to work just fine, parallel scraping with many (I had 79) URLs fails, resulting in missed URLs and duplicates while the total sum of fetched URLs is correct.

I suspect the reason to be the queuing implementation. I tried a little fix on scraper.js that produced results I was hoping.

gaara87 commented 13 years ago

I too faced this problem. its just 20 urls, rather than 79.

Is there a way to enforce a timeout?

deggis commented 13 years ago

What I remember, I think timeout wouldn't help, I think I tried one arrangement. I'm not a JS guru, so I'm not deadsure what would be a rock solid fix for this, but mine worked for me at least :)

deggis commented 13 years ago

(Whoops, Comment & Close was kinda too close.)

nickewansmith commented 11 years ago

Try using cheerio instead of jsdom and then implementing your own queuing, worked for me!