systeminit / si

The System Initiative software
https://systeminit.com
Apache License 2.0
955 stars 67 forks source link

feat: nack veritech function run failures #4606

Closed sprutton1 closed 3 days ago

sprutton1 commented 4 days ago

This allows Veritech to nack function run requests that fail prior to the actual function execution, such as if the cyclone pool is starved or the message fails to deserialize. On nack, the message will be put back into Jetstream, allowing an Veritech to pick it up (the same one or a different one) and try again. This should help cover off on pool starvation scenarios, giving us a chance the retry failures and let the pool refill. We will retry messages up to five times before giving up.

In the future, we should consider a dead letter queue so we can determine what to do with these when they completely fail out.

This PR also reduces the execution pool timeout to 2 minutes, assuming that if we can't serve the get within that window another Veritech might be able to. I also added a timeout to the pool healthcheck to catch cases where we might hang on startup.

For @fnichol and @nickgerace , I made a light attempt at refactoring this to remove the code duplication, but found wrestling with the types made it a fairly large effort and thought it would be better served in a PR that does the refactor only so we aren't also contending with behavior changes while trying to get it right.

sprutton1 commented 4 days ago

/try

sprutton1 commented 4 days ago

/try