Open sibiryakov opened 8 years ago
@lljrsr This is mostly done because there is no clear way of controlling consumption rate on producer side in ZeroMQ. Moreover the contents of queue on ZeroMQ side and the way it decides of high water mark, and starts dropping messages is implicit. Client application has no way to track any of this.
We could partially solve this problem, by disabling new batch generation if there are no heartbeats from spiders. Currently, spiders are sending offsets once per every get_next_requests
call. So we could try disabling new batch generation until DBW gets first offset. At the moment, all partitions are marked ready after DBW start.
Yes. I think that would be a good idea. It may increase the startup time for spiders but it would make frontera easier to use because this issue would no longer be there.
UPDATE Actually I think that a better approach would be to only delete requests from the queue once the spider has actually looked at them. The DW would then push indefinitely (although it might not need to) and some duplicate requests might get pushed to ZMQ but at least there would be no actual message loss. You could add the heartbeat feature additionally. But it is much more important that every request actually gets executed at least once. Also a polling approach could be nice (the DW checks very often for spider availability and pushes only if spiders explicitly request it.).
From here https://github.com/scrapinghub/distributed-frontera/issues/24#issuecomment-181386301