istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

Slow Scheduler Memory Build Up #64

Closed madisonb closed 8 years ago

madisonb commented 8 years ago

The Distributed Scheduler keeps every domain queue it has ever seen in memory, so we can do extremely fast look up loops against all known domain keys. In turn, we only update the domain queues once every X seconds for new domains that we have never seen before. With every new domain we see, we create an object in memory representing the way to access the redis based queue.

The problem is that these objects inside of the scheduler are never deleted. If we see new domains every time we check, eventually we will get to a point where we run out of available memory on the host because we simply keep adding more and more objects to our lookup dictionary.

The solution to this problem can be solved in a number of ways:

  1. An expiring timer on every dictionary key, in memory, and if there has not been a successful pop() from that queue within X seconds it gets cleaned from memory.
  2. A LRU cache based setup where if the key count exceeds a certain threshold (like 10,000) we begin to delete keys least recently used every time we add new ones. This may also involve both a timestamp and a count of the number of times a key has been used.
  3. Delete all keys every X seconds, whether it is every day, hour, etc. This is very naive.

I am in favor of point 1. Implementation would be then:

If you set the threshold to something like 1 year (large), you would expect the queue to grow until all available memory is used. If you set the threshold to 10 minutes (small), you would expect the queue dictionary to only grow to the size of all known domains within the past 10 minutes.

I think this solves the slow memory growth we see when crawling millions of different domains over time.

madisonb commented 8 years ago

There is also a pip package that claims to already do this, but I think we need something more than just 'on key access' removal. https://pypi.python.org/pypi/expiringdict. I think the home rolled solution is going to work the best here, since we randomly loop over the keys anyways and can check for expiration.

madisonb commented 8 years ago

This will undergo at scale testing over the weekend, hoping to close this Monday.

madisonb commented 8 years ago

Over the past 5 days the cluster the test was deployed on crawled 250,000 different domains and the memory footprint is almost the same as when the crawlers were restarted. I am going to close this as completed.