openfaas / nats-queue-worker

Queue-worker for OpenFaaS with NATS Streaming
https://docs.openfaas.com/reference/async/
MIT License
128 stars 59 forks source link

Clarify queue-worker durable queue implementation and delivery semantics #84

Closed embano1 closed 3 years ago

embano1 commented 4 years ago

Before we start working on DLQ (#81) and ACK options (#80) it would be good to get a better understanding of the current code regarding the durable queue subscription.

1) Durable queue seems to be optional (default off) as per environment configuration (see recent fix #76) - why is durable subscription not default/always on? Isn't that the desired behavior in async mode? 2) var durable, even if unset, will be passed to q.conn.QueueSubscribe() with stan.DurableName(q.durable) - what is the STAN behavior if one passes an empty string here? STAN code doesn't check, so I am not 100% sure about the expected behavior (especially in combination with the queue-worker unsubscribe call during shutdown) 3) q.startOption is set to stan.StartWithLastReceived() - related to the unclear (at least to me) behavior described in (2) durable subscription semantics are described as "Note that once a queue group is formed, a member's start position is ignored" (redundant?) 4) stan.AckWait(q.ackWait) (30s) is overwritten by faas-netes [1], we should be consistent here 5) ack, ack-wait and persistency (durability) behavior (in-memory) should be documented so users understand the "at most once delivery" semantics for the async implementation (no retries for sync gateway fn calls/callbacks if specified) - with one exception (see next) 6) if the invoked (sync) function runs longer than ack-wait (30/60s default depending on environment), a message could be delivered multiple times when running multiple queue-workers - receivers should account for idempotency (clarify message handling in docs) 7) returning "202 accepted" by the gateway for async routes underlines the "at most once delivery semantics" (e.g. edge case where message is not persisted and gateway crashes) - would a "201 created" be more aligned with what users expect (especially when considering "at least once" semantics with higher durability?); also see related issue [2] 8) several different timeout handlers are at play (http client for gateway/callback invocations, STAN ack-wait handling) - more consistency or guidelines in docs 9) with the upcoming architectural changes in NATS (JetStream) how does this affect the queue-worker implementation (tracking item to keep this in mind when making changes)

cc/ @bmcstdio

[1] https://github.com/openfaas/faas-netes/blob/41c33f9f7c29e8276bd01387f78d6f0cff847890/yaml/queueworker-dep.yml#L50

[2] https://github.com/openfaas/faas/issues/1298

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

1. 2. 3. 4.

Context

Your Environment

alexellis commented 3 years ago

This question came up on Slack recently. @matthiashanel had a rather good answer, but it looked like OpenFaaS can be configured for at least once delivery - you need to configure NATS as per the production guide in the docs.

If you are using the asynchronous invocations available in OpenFaaS then you may want to ensure high-availability or persistence for NATS Streaming. The default configuration uses in-memory storage. Options include clustering NATS and MySQL.

From Matthias:

Hi @Alec Thomas, on the api level streaming does at least once. The streaming server will re deliver messages to you as soon as ackwait is exceeded. (this is why you need to set ackwait to a value that is greater than your expected processing time. ) If your functions do not accumulate state, they can handle re deliveries and do not need to know about a cursor position. Basically async will ack the message once your function returns. But, if i recall correctly, the default setup for the streaming server in openfaas is single server in memory storage (perfectly fine default). Meaning it goes down, you won't be able to publish and depending on the failure will have lost messages. In order to prevent these kind of outages you can configure streaming to run in cluster mode Then you have multiple instances of the streaming server running, they use raft to coordinate. This should give you the availability you seem to be asking about. We also support a FT mode where you can start multiple server and synchronization is done through the file system (NTFS) or a database (in case you have one already. This works more like a cold standby. once the FS or DB lock is acquired, the server recovers state and is available. (recovery is proportional to the number of messages stored) . The last alternative is using a single server, a persistent volume mount but store messages to disk. this way you you won't get HA but at least messages are not lost.

From Lucas:

I am happy to say the default deployment attempts "at least once" delivery but to get true "at least once" the streaming server needs to be setup with HA and persistence This isn't much different from the same caveats that NATS gives (https://docs.nats.io/developing-with-nats-streaming/streaming) about "at least once": it is attempted to the best of its ability but server failure could cause losses, use HA to avoid this.