stacks-network / stacks-core

The Stacks blockchain implementation
https://docs.stacks.co
GNU General Public License v3.0
3.01k stars 667 forks source link

Improvements to event observer timeout #5288

Open CharlieC3 opened 1 week ago

CharlieC3 commented 1 week ago

Related

Currently the default timeout for an event observer is 1s, but it can be configured to be higher if needed. There is at least one time when a timeout of roughly 30 or more seconds is needed to process block 0 of the Nakamoto Testnet when the Stacks Blockchain API is a configured observer. This is because block 0 is typically very large, and it takes a large amount of time to write all this data to the API's database. While this is the only example we know of at the moment, that does not mean a timeout of 1s is sufficient for every other scenario.

Increasing the timeout to something like 60s for any API observer syncing from genesis is a way around this issue. However in the event of transient network issues, a 60s timeout is often much longer of a time to wait than what operators would prefer when a simple retry is all that's needed to resolve the connection problem. Additionally, a lot of people run APIs; it would be nice if another node configuration wasn't needed just to sync from genesis.

In a thread regarding this topic, some ideas were suggested to circumnavigate this problem by making the timeout for event observers dynamic based on the following:

A) The number of times the timeout has been reached. If a timeout of 1s has been reached, double it on the next attempt and try 2s. Continue doubling it until some maximum timeout (like 60s) is reached. B) The size of the payload. Generally a large payload is going to take longer to send and for the receiving service to process it and return a response than a small payload.

Both of these qualities can be used to dynamically determine the appropriate timeout for an observer without being hard-stuck on a timeout that's too long or a timeout that's too short for every event.

jcnelson commented 1 week ago

Can we just set the timeout to the age of the universe or something else unreasonably large? Like, the node should stall forever until the data gets pushed to and acknowledged by the observer.

CharlieC3 commented 1 week ago

@jcnelson Brice and I recall there have been times where a retry is what helped the stacks node recover when communicating with an observer. If the timeout was unreasonably large, the retry would've never occurred and may have resulted in a stalled node.

From @obycode

Yeah, now I remember that there was a problem we were seeing where a retry did help. That's actually what forced us to switch over to this HTTP implementation in the first place, because the one we were using didn't allow us to timeout during the connect, so it just got stuck there forever.

jcnelson commented 1 week ago

Ah, yes, forgot that.

I'm fine with option A, as long as we have a cap on the timeout.

obycode commented 1 week ago

Yeah, just to clarify, the node still will stall forever. The timeout just defines how long it will wait before retrying. It will still keep retrying forever.

CharlieC3 commented 1 week ago

I think both A and B would be needed in order to have a comprehensive solution that considers certain side-cases we're already aware of. Rafael and @zone117x report there could be issues with the API if a large payload is sent and the timeout + retry occurs too quickly before the API has a chance to respond, and this may cause database corruption which would thus stall the stacks node.

Initially, it might seem like setting the timeout higher for API observers is a good idea, like 60s. However by doing so, you lose the benefit of retrying the connection quickly during transient network problems. Most payloads to the API are very fast, it's only during extremely rare times where a large enough payload comes through which may require a higher timeout. Ideally a downstream observer like this wouldn't be locked to a high timeout for every payload. So if payload size was taken into account (point B), then the first request attempt would size the timeout appropriately if the payload is larger than a certain threshold (e.g. 10MB) to avoid the problem where requests are retried too quickly. This would allow an observer to continue benefitting from quick timeouts/retries on all other payloads.