Closed embano1 closed 2 years ago
/set title: [Feature request] Dead letter queue for NATS
NATS does not provide a DLQ. I spent some time looking into build a DLQ when building colorisebot, but it's complicated. If the upstream API is failing due to rate-limiting, then retrying N times without an appropriate back-off is counter-productive.
My actions before raising this issue
Using async invocation it seems there's no way to tell whether the invocation eventually succeeded. Failure could be caused by API issues, functions being deleted/not accepting connections (SIGTERM), event payload issues causing exceptions or simple app logic bugs within the function.
For async invocation this is usually handled with a dead letter queue (DLQ). I could not find any mention of DLQ support in OpenFaaS/NATS (STAN). How is this dealt with today? Is it a concern at all? Does STAN automatically redrive failed invocations? If so, how many until it gives up?
Expected Behaviour
Failure during async function invocation should be trackable, if possible using DLQ where events can be inspected and potentially redriven.
Current Behaviour
Tested async invocation via
faas-cli
and a connector usingconnector-sdk
where the subscribed function does not exist (anymore). There was no error reported leaving the caller believing that the invocation would eventually succeed (even though202
technically does not give a guarantee, so introspection capabilities would be generally useful in a202
setup).A work around seems to be to provide callbacks where the error status can be introspected. Not sure if this is always possible (CLI) or desired.
Details see here: https://github.com/openfaas/faas/issues/1298
Possible Solution
Implement a DLQ capability. Are there already metrics exposed for failed async function invocations?
Steps to Reproduce (for bugs)
Simply call
faas-cli -a
(orcurl
) on a non-existing function.Context
I sense potential consistency issues (no error reported while the function was not executed at all) leading to hard to debug issues. Also, malformed payloads and application logic bugs could be hidden by the current implementation (if my understanding of the issue is correct and complete).