It's difficult to understand if a very long timeout in `fetch` could stall the extraction worker's progress

TimDaub commented 2 years ago

consider this: We implement better-queue with a configurable concurrency parameter
But in some cases, a timeout in fetch can take up to 300 seconds (as implemented in Chrome): https://www.benmvp.com/blog/quickie-fetch-timeout/
So if e.g. we set concurrency to 200, then the following effect could occur:
- we crawl with 200 requests per seconds concurrently, but occasionally a request takes 300 secs to resolve and so it takes 1 spot in the queue for 300 seconds.
- If this happens often, suddenly we could have 200 requests taking 300 seconds to resolve and so practically we're not making requests to healthy endpoints anymore and the extraction-worker stalls
for now, this is merely a suspicion I have and e.g. fixing #22 would help to understand if this really happens
E.g. we could allow the user to configure a MAX_TIMEOUT that stops a request from the queue after some time
Later, we could even start calculating a "healthy request time metric" that'd allow us to remove unhealthy-looking requests from the queue.

/cc @il3ven

TimDaub commented 2 years ago

Debugging journal:

I've now enabled debugging by logging queue.getStats() which is actually super helpful here
It's exactly as I expected where the average task completion time goes up a lot over the time of requesting data
At first it's just a few milli seconds and then (just before the first fetch timeouts happen), it averages 45s per request (which is huge).

2022-06-29T09:44:40.703Z neume-network-extraction-worker:worker {"successRate":0.9954193093727978,"peak":29703,"average":45504.534531360114,"total":5676}

TimDaub commented 2 years ago

Found that there's potentially a problem with better-queue's maxTimeout: https://github.com/diamondio/better-queue/issues/81

TimDaub commented 2 years ago

eth-fun@0.8.0 allows managing timeouts on requests https://github.com/rugpullindex/eth-fun/blob/master/CHANGELOG.md#080
Via message-schema we should allow a timeout option: https://github.com/neume-network/message-schema/issues/18, extraction-worker should then timeout individual requests

il3ven commented 2 years ago

I did an experiment. I pushed 6 tasks to the queue. The second task should take a very long time. I found that the second task did not stall the queue if the concurrency was greater than 1.

The above makes sense. We can imagine it like this. With concurrency equal to two we have two workers that can execute our tasks in parallel. If one of the worker gets blocked due to a long task the other worker can keep on executing the tasks.

https://user-images.githubusercontent.com/4337699/177052097-f60d7970-a323-42f1-a8ec-89e651b297e2.mov

TimDaub commented 2 years ago

If one of the worker gets blocked due to a long task the other worker can keep on executing the tasks.

yes, but I'm outlining the problem where we potentially have a concurrency of e.g. 200 parallel workers and then over time while all non-problematic tasks aren't blocking the queue, there are a total of > 200 tasks that can clog up the queue. Think about it this way: We have 20000 tasks to execute but only 200 tasks that take e.g. 5mins to clear, then if those 200 bad tasks are spread over those 20000 good tasks, we have a good chance that the queue is clogged up and not running at full concurrency all the time. Hence further allowing to configure timeouts to more efficiently ending uneconomic tasks can be a good thing.

neume-network / extraction-worker

It's difficult to understand if a very long timeout in `fetch` could stall the extraction worker's progress #23