Open qiongzhu opened 5 months ago
I can reproduce this, even with 2.11
If both sides are WQ that is the way it is designed. Removing from one stream currently never affects the other source/mirror.
As an aside, what is your use case for streaming from a work queue stream? Work queue stream exist in order to provide exactly once consumption (transmission plus processing) of messages, not really to be sourced from.
As an aside, what is your use case for streaming from a work queue stream? Work queue stream exist in order to provide exactly once consumption (transmission plus processing) of messages, not really to be sourced from.
Here is the use case: we have hub-spoke topology with multiple leafnodes connected to hub cluster, and we want 'move' messages from hub to spoke. so we send message in hub cluster, record it into a workqueue jetstream, then 'sources' the stream in spoke side.
In this way, messages that are replicated to spoke cluster, will be deleted from hub cluster; then we have high availablity on both sender/receiver side even if leafnode connection interrupts.
In this issue, we just found some inconsistency about workqueue jetstream:
Observed behavior
WorkQueue jetstream messages are not deleted on non-leader nodes when used as mirror source; jetstream members holds inconsistent stream history, could not recover from that state
Expected behavior
Server and client version
nats-server: 2.10.14 and 2.9.25 both have this problem natscli: 0.1.4
Host environment
using official docker image:
nats:2.10.14
ornats:2.9.25
official binary release also have the same problem
Steps to reproduce
env step: local 3 nodes nats cluster
create a simple config file
nats-account.conf
with following contentrun a fully local 3-nodes cluster with docker; you can use
nats:2.10.14
ornats:2.9.25
.then wait some time for the cluster startup. now create nats cli context for easy access
steps to reproduce this problem
create a jetstream as mirrored source stream, use file based R=3
create a jetstream as mirror destination stream
use all defaults in subsequent questions about how to import the source stream. that is:
now we have following stream report. notice that the replication report indicates mirror is now working
now we send 10 messages to source stream
check the stream report, it is ok, the message mirrored to dst stream, and removed from src (because src is a workqueue stream):
check stream state of
src
, it is ok. please notice that current leader is node1Here is the problem: now we request cluster election, or restart current leader of the stream 'src' to force election, like this:
then the stream run into trouble. for example, we run 'step down', the immediate output shows:
those message consumed by replication reappears in stream
src
, the status can be verified vianats stream state src
, ornats stream get src ${idx}
, those message indeed can be accessed.luckily we can issue serveral other
nats stream cluster step-down src
command, to re-select the correct leadernode1
to make the stream correct again.to make the stream cluster consistent again, following steps can help:
node1
nats stream edit src --replicas=1 -f
nats stream edit src --replicas=3 -f
After that, the stream looks like normal, but it is not. we can issue multiple
nats stream cluster step-down src
commands to select a leader out ofnode1
, for example selectnode2
as leader, then send another 10 messages byfor idx in {0..9} ; do nats req 'src.hello' "${idx} | $(date)" ; sleep 1 ; done
use
nats stream report
we can see the 10 new message now mirrored todst
. then runnats stream state src
, we can see that the 10 new messages in workqueue are not deleted by mirror processafter that, we can issue multiple
nats stream cluster step-down src
commands until leadernode1
is selected, we can see its states is correct, even this node is not leader at that time