Closed madisonb closed 8 years ago
There is also a pip package that claims to already do this, but I think we need something more than just 'on key access' removal. https://pypi.python.org/pypi/expiringdict. I think the home rolled solution is going to work the best here, since we randomly loop over the keys anyways and can check for expiration.
This will undergo at scale testing over the weekend, hoping to close this Monday.
Over the past 5 days the cluster the test was deployed on crawled 250,000 different domains and the memory footprint is almost the same as when the crawlers were restarted. I am going to close this as completed.
The Distributed Scheduler keeps every domain queue it has ever seen in memory, so we can do extremely fast look up loops against all known domain keys. In turn, we only update the domain queues once every X seconds for new domains that we have never seen before. With every new domain we see, we create an object in memory representing the way to access the redis based queue.
The problem is that these objects inside of the scheduler are never deleted. If we see new domains every time we check, eventually we will get to a point where we run out of available memory on the host because we simply keep adding more and more objects to our lookup dictionary.
The solution to this problem can be solved in a number of ways:
pop()
from that queue within X seconds it gets cleaned from memory.I am in favor of point 1. Implementation would be then:
queue_dict[final_key]
with a tuple of (ThrottledQueue, timestamp)pop()
succeeds update the timestamp, else check if the timestamp diff is greater than our delete threshold.If you set the threshold to something like 1 year (large), you would expect the queue to grow until all available memory is used. If you set the threshold to 10 minutes (small), you would expect the queue dictionary to only grow to the size of all known domains within the past 10 minutes.
I think this solves the slow memory growth we see when crawling millions of different domains over time.