[Feature request] Dead letter queue for NATS

embano1 commented 4 years ago

My actions before raising this issue

[x] Followed the troubleshooting guide
[x] Read/searched the docs
[x] Searched past issues

Using async invocation it seems there's no way to tell whether the invocation eventually succeeded. Failure could be caused by API issues, functions being deleted/not accepting connections (SIGTERM), event payload issues causing exceptions or simple app logic bugs within the function.

For async invocation this is usually handled with a dead letter queue (DLQ). I could not find any mention of DLQ support in OpenFaaS/NATS (STAN). How is this dealt with today? Is it a concern at all? Does STAN automatically redrive failed invocations? If so, how many until it gives up?

Expected Behaviour

Failure during async function invocation should be trackable, if possible using DLQ where events can be inspected and potentially redriven.

Current Behaviour

Tested async invocation via faas-cli and a connector using connector-sdk where the subscribed function does not exist (anymore). There was no error reported leaving the caller believing that the invocation would eventually succeed (even though 202 technically does not give a guarantee, so introspection capabilities would be generally useful in a 202 setup).

A work around seems to be to provide callbacks where the error status can be introspected. Not sure if this is always possible (CLI) or desired.

Details see here: https://github.com/openfaas/faas/issues/1298

Possible Solution

Implement a DLQ capability. Are there already metrics exposed for failed async function invocations?

Steps to Reproduce (for bugs)

Simply call faas-cli -a (or curl) on a non-existing function.

Context

I sense potential consistency issues (no error reported while the function was not executed at all) leading to hard to debug issues. Also, malformed payloads and application logic bugs could be hidden by the current implementation (if my understanding of the issue is correct and complete).

alexellis commented 4 years ago

/set title: [Feature request] Dead letter queue for NATS

alexellis commented 4 years ago

NATS does not provide a DLQ. I spent some time looking into build a DLQ when building colorisebot, but it's complicated. If the upstream API is failing due to rate-limiting, then retrying N times without an appropriate back-off is counter-productive.

https://github.com/alexellis/mailbox

https://github.com/alexellis/rate-limited-mailbox

openfaas / nats-queue-worker