scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.3k stars 217 forks source link

Message loss in spider feed #109

Open sibiryakov opened 8 years ago

sibiryakov commented 8 years ago

From here https://github.com/scrapinghub/distributed-frontera/issues/24#issuecomment-181386301

Another issue I noticed recently is that my DW keeps on pushing to all partitions although I have no spider running. When I start up my spiders now, they wait until the DW has pushed a new batch (although he pushed multiple times before that). This means that after running the DW for a while without there being any spider running depletes the queue until there is nothing left to crawl. I have to add new seeds (not the same seed URLs, since multiples get dropped) for the spiders to start again and the DW to push new requests again.

sibiryakov commented 8 years ago

@lljrsr This is mostly done because there is no clear way of controlling consumption rate on producer side in ZeroMQ. Moreover the contents of queue on ZeroMQ side and the way it decides of high water mark, and starts dropping messages is implicit. Client application has no way to track any of this. We could partially solve this problem, by disabling new batch generation if there are no heartbeats from spiders. Currently, spiders are sending offsets once per every get_next_requests call. So we could try disabling new batch generation until DBW gets first offset. At the moment, all partitions are marked ready after DBW start.

lljrsr commented 8 years ago

Yes. I think that would be a good idea. It may increase the startup time for spiders but it would make frontera easier to use because this issue would no longer be there.

UPDATE Actually I think that a better approach would be to only delete requests from the queue once the spider has actually looked at them. The DW would then push indefinitely (although it might not need to) and some duplicate requests might get pushed to ZMQ but at least there would be no actual message loss. You could add the heartbeat feature additionally. But it is much more important that every request actually gets executed at least once. Also a polling approach could be nice (the DW checks very often for spider availability and pushes only if spiders explicitly request it.).