Closed zeluspudding closed 6 years ago
Hi @zeluspudding, nothing in your code looks really off. Without having everything setup on my end, it's a bit hard to help you with the problem. I would definitely yield up front before starting on each URL.
If you're still having this problem, please reopen with some additional details on how we can run this ourselves. Thanks!
I have a scraping job I'd like to multithread because I have several thousand urls I need to test. The code below 1) reads in a csv with my url targets, 2) chunks those targets, 3) distributes those chunks to nightmare sessions so that they 4) visit each url in the chunk after logging into a website. Finally, each worker 5) writes their results to csv. The script below seems to work as desired except that one worker always scrapes most of its allotted urls (say 35 of 40) while the others don't (say 8 of 40). The same behavior is seen whether I have 2 workers or 15. Why?
At first I thought it was because the first worker would finish then somehow terminate other sessions. But that doesn't seem likely since other sessions finish saving their csv results up to a minute after the first one is done. What's more, each session creates its own memory space... so that can't be it.
In general, running multiple workers in any application chokes resources to all of them. But if that were the issue here I'd think all the workers would have similar throughput... not one worker with high throughput and the others very little.
Here's something weird: workers scrape the same urls multiple times. I'm not sure why but the error rate seems to increase with the number of workers... duplicating work and totally wasting scrape cycles. In some cases I've had the same url scraped 170 times.
What am I doing wrong?