a) stop using randomization;
b) Restrict the number of workers to the maximum number of connections for a given domain divided by a given number (i.e.: if http://mysite.com has a limit of 2 connections and we decide that each site has in average 10 links, we divide 2 / 10 = 0.2 and round it up. That means that only one worker should be working in http://mysite.com pages at a time);
c) Retrieve the next X pages in need of review, where X is the number of workers;
d) Use lock in redis to make sure each worker gets an appropriate job. If the lock can't be acquired, try the next page in need of review.
A couple things to keep in mind for this:
a) stop using randomization; b) Restrict the number of workers to the maximum number of connections for a given domain divided by a given number (i.e.: if http://mysite.com has a limit of 2 connections and we decide that each site has in average 10 links, we divide 2 / 10 = 0.2 and round it up. That means that only one worker should be working in http://mysite.com pages at a time); c) Retrieve the next X pages in need of review, where X is the number of workers; d) Use lock in redis to make sure each worker gets an appropriate job. If the lock can't be acquired, try the next page in need of review.