nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.94k stars 1.41k forks source link

Jetstream streams blocking subscribers #4248

Open Ann-Geo opened 1 year ago

Ann-Geo commented 1 year ago

nats-server version used: 2.9.17 nats.c client library version: 3.6.1 Issue: The jetstream streams seem to be blocking any subscribers in the middle of the run preventing subscribers to receive any messages from the streams. The server log shows a warning on the consumer for the blocked stream when this happens: "Consumer ... error on store update from snapshot entry: old update ignored."

Would it be possible to know what is blocking the subscription from the stream ?

More info: A three server jetstream cluster is used for the runs here. Server configuration for one of the servers in the cluster:

listen: 0.0.0.0:4222

server_name: n1
jetstream: true

accounts: {
    SYS: {
        users: [
            { user: admin, password: xxxxx }
        ]
    },
}

system_account: SYS

cluster {
  listen: xx.xx.xxx.xx:4248
  name: test-cluster

 routes: [
        nats-route://xx.xx.xxx.xx:4248
        nats-route://xx.xx.xxx.xx:4248
        nats-route://xx.xx.xxx.xx:4248
  ]
}

jetstream: {
    max_memory_store: 1GB
}

Stream configuration:

{
  "config": {
    "name": "test",
    "subjects": [
      "test.files.request"
    ],
    "retention": "workqueue",
    "max_consumers": -1,
    "max_msgs_per_subject": -1,
    "max_msgs": 10240,
    "max_bytes": -1,
    "max_age": 0,
    "max_msg_size": -1,
    "storage": "memory",
    "discard": "new",
    "num_replicas": 3,
    "duplicate_window": 120000000000,
    "sealed": false,
    "deny_delete": true,
    "deny_purge": true,
    "allo_rollup_hdrs": false,
    "allow_direct": true,
    "mirror_direct": false
  }
}

Consumer configuration for the stream:

{
    "durable_name": "test-files-consumer",
    "ack_policy": "explicit",
    "ack_wait": 10000,
    "deliver_policy": "all",
    "filter_subject": "test.files.request",
    "max_ack_pending": 1000,
    "max_deliver": -1,
    "max_waiting": 512,
    "replay_policy": "instant",
    "max_batch": 1,
    "num_replicas": 0
}
Ann-Geo commented 1 year ago

This issue seems to be present even if the max_msgs are increased to 10x. I am seeing that over multiple runs the messages in the streams are not getting fully consumed. Unconsumed messages get accumulated in the streams and eventually they total up to the max_msgs size of the stream and blocks subscribers receiving messages. I am also seeing multiple warnings related to the consumer from this blocked stream:

Jetstream consumer .... is not current
RAFT .... Resetting WAL state
Consumer ... error on store update from snapshot entry: old update ignored
RAFT ... 20000 append entries pending

However I am not sure if any of these are related to the particular stream-blocking behaviour I am observing. Sometimes triggering a cluster step-down for the blocked stream using nats cli tool seems to clear the blocking of the stream, but loses all the messages in the stream. Also this won't permanently solve the problem and the blocking reoccurs in the future runs. Also once the stream is blocked, it is not possible to get the consumer info neither using js_GetConsumerInfo API (returns timeout) nor using nats command line tool (gives context deadline exceeded error).

is there any resolution for this issue ?

derekcollison commented 1 year ago

Will loop in @levb to take a look since it is the C client.

Ann-Geo commented 1 year ago

This problem appears to be occurring only when using the stream storage as memory. When configured to use the file store, the consumer/stream does not give any errors or show blocking behaviour.

Any updates on this issue ?

derekcollison commented 1 year ago

Have you tried the latest server version, 2.9.19?