ukwa / wren

Experiments in testable, scaleable crawler architectures
GNU Affero General Public License v3.0
5 stars 3 forks source link

Re-implement pulse crawling method #1

Closed anjackson closed 7 years ago

anjackson commented 8 years ago

The continuous crawling approach developed so far has, sadly, proven unstable. Instead, we will fall back on the existing 'pulse' crawling approach while we work out how best to proceed.

The 'pulse' approach is a limited approach, a compromise that delivers stable crawls but works around the inability of H3 to easily support the large number of separate crawls that our curators define in W3ACT. Instead, each set of Targets is grouped by frequency, and each frequency launches at the same point in a regular cycle. For example, there is a daily crawl that is stopped and re-launched every day at 9am. We therefore only get seeds at that time (roughly), and only ever get one-day deep. However, it does give stable and predictable job 'chunks' and H3 can cope.

The old code had become confused due to the split between the stable system and the document harvesting system. The code has been merged, but needs updating and testing, with the document harvester approach being supported appropriately.

To Do

Other ideas

anjackson commented 7 years ago

This Celery-based approach didn't work out, so I've moving this over to ukwa/python-shepherd#10.