scrapinghub / hcf-backend

Crawl Frontier HCF backend
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

Potential problem with reading batches when batches are deleted only on consumer close? #15

Open hermit-crab opened 5 years ago

hermit-crab commented 5 years ago

Good day. Let's say we have a million requests inside a slot, then consumer defines either HCF_CONSUMER_MAX_REQUESTS = 15000 or HCF_CONSUMER_MAX_BATCHES = 150 or it just closes itself at after N hours. Then it also defines HCF_CONSUMER_DELETE_BATCHES_ON_STOP = True, so it only purges batches upon exiting.

In this case, since as far as I can tell there is no pagination for scrapycloud_frontier_slot.queue.iter(mincount) the consumer will be iteration only the initial MAX_NEXT_REQUESTS reading them over and over till it reaches either max requests / max batches / self enforced time limit, won't it?

starrify commented 5 years ago

Unfortunately hubstorage doesn't paginate HCF's slot read (yet?) (code). Therefore HCF_CONSUMER_DELETE_BATCHES_ON_STOP would cause the backend to access only the initial one read (approximately MAX_NEXT_REQUESTS requests).

It seems that a good approach may not be available unless hubstorage supports pagination.

One possible workaround is to delete batches when the spider is idle (something like HCF_CONSUMER_DELETE_BATCHES_ON_IDLE?) to make sure previous requests are all consumed, before performing the next read. However this would usually harm the spider's concurrency / throughput.

kalessin commented 5 years ago

The idea behind HCF_CONSUMER_DELETE_BATCHES_ON_STOP is to support cases when one need to ensure that batches are deleted only if spider finishes. And yes, that requires to set MAX_NEXT_REQUESTS to the same number of requests/batches you want to read per job. So if you have 1 million requests, it is better if you split the crawling of them among multiple jobs.

About alternative mentioned by @starrify, similar behavior but deleting on IDLE signal would require a different spider architecture. If you use scrapy-frontera scheduler, new requests are not read on idle signal but each time there are available concurrency slots and local requests queue is empty.

A possibility, without depending on hubstorage, is to implement (in scrapy-frontera?) some advanced request tracking that deletes each batch once all requests for that batch were processed (either successfully, or with errors) and provided some configurable conditions are met (for example, max number of errors, things like that). Even if in future hubstorage provides hcf batches pagination, that would still be a useful feature, that is not the first time it was talked about.

hermit-crab commented 5 years ago

I haven't thoroughly checked the code if it's already there but would it make sense to enforce consumer closure (i.e. only reading hcf once) after the first MAX_NEXT_REQUESTS are read or document / show a warning when a HCF_CONSUMER_MAX_REQUESTS/HCF_CONSUMER_MAX_BATCHES values exceed MAX_NEXT_REQUESTS (when DELETE_BATCHES_ON_STOP turned on)? This seems to be a non obvious surprise issue that a user could only catch if he's well familiar with the framework mechanics and hubstorage limitations.

kalessin commented 5 years ago

Yes, that would be a good idea, @hermit-crab