nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.91k stars 1.41k forks source link

NATS complaining > "JetStream failed to store a msg on stream" subscriber getting no message #4192

Closed meetme2meat closed 1 year ago

meetme2meat commented 1 year ago

Our production system has been idle state because the nats message that was sent to nats is no longer getting retrieved.

I certainly have for sure, 829 messages (observe that on our application logs) in one of the topics(i.e content.ingest)

Following how I published

var defaultJsContext nats.JetStreamContext
func Publish(topic string, msg []byte, opts ...nats.PubOpt) (*nats.PubAck, error) {
    zap.L().Debug("--Publish", zap.String("topic", topic), zap.ByteString("message", msg)) // 829 log entry
    return defaultJsContext.Publish(topic, msg, opts...)
}

and nats did not report any error when publishing these messages yet when we looked at

nats stream subjects sugarbox
No subjects found matching >
nats consumer info
? Select a Stream sugarbox
? Select a Consumer ingestConsumer
Information for Consumer sugarbox > ingestConsumer created 2023-05-23T08:50:34Z

Configuration:

                Name: ingestConsumer
    Delivery Subject: content.ingest.GNY17JR07CAWA8X4806BZE
      Filter Subject: content.ingest
      Deliver Policy: All
          Ack Policy: Explicit
            Ack Wait: 5m0s
       Replay Policy: Instant
  Maximum Deliveries: 5
     Max Ack Pending: 65,536
        Flow Control: false

State:

     Last Delivered Message: Consumer sequence: 825 Stream sequence: 871
     Acknowledgment floor: Consumer sequence: 825 Stream sequence: 871
     Outstanding Acks: 0 out of maximum 65,536
     Redelivered Messages: 0
     Unprocessed Messages: 0
     Active Interest: No interest ## this no interest now because I have stop the consumer for now
nats stream ls sugarbox
╭───────────────────────────────────────────────────────────────────────────────────╮
│                                      Streams                                      │
├───────────┬─────────────┬─────────────────────┬──────────┬─────────┬──────────────┤
│ Name      │ Description │ Created             │ Messages │ Size    │ Last Message │
├───────────┼─────────────┼─────────────────────┼──────────┼─────────┼──────────────┤
│ sugarbox  │             │ 2023-05-23 08:50:34 │ 710      │ 2.2 MiB │ 7h41m28s     │
╰───────────┴─────────────┴─────────────────────┴──────────┴─────────┴──────────────╯

Clear there are 710 messages(less than 829 though) but

Using a nats-cli client to subscribe to the given topic yields nothing.

nats subscribe --stream=sugarbox --all content.ingest
14:27:04 Subscribing to JetStream Stream holding messages with subject `content.ingest` starting with the first message received

In fact I subscribed to all the topics(one at a time) with --all flag not one of them returned any message.

We restarted Nats couple of times and we saw every time (hoping it will solve the problem)

ay 24 06:55:11 edge nats-server[500]: [500] 2023/05/24 06:55:11.536532 [INF]
May 24 06:55:11 edge nats-server[500]: [500] 2023/05/24 06:55:11.536534 [INF] ---------------- JETSTREAM ----------------
May 24 06:55:11 edge nats-server[500]: [500] 2023/05/24 06:55:11.536540 [INF]   Max Memory:      5.75 GB
May 24 06:55:11 edge nats-server[500]: [500] 2023/05/24 06:55:11.536542 [INF]   Max Storage:     31.45 GB
May 24 06:55:11 edge nats-server[500]: [500] 2023/05/24 06:55:11.536544 [INF]   Store Directory: "/var/lib/nats-server/jetstream"
May 24 06:55:11 edge nats-server[500]: [500] 2023/05/24 06:55:11.536545 [INF] -------------------------------------------
May 24 06:55:11 edge nats-server[500]: [500] 2023/05/24 06:55:11.541652 [INF]   Starting restore for stream '$G > sugarbox'
May 24 06:55:11 edge nats-server[500]: [500] 2023/05/24 06:55:11.616847 [INF]   Restored 709 messages for stream '$G > sugarbox'
May 24 06:55:11 edge nats-server[500]: [500] 2023/05/24 06:55:11.618532 [INF]   Recovering 23 consumers for stream - '$G > sugarbox'
May 24 06:55:11 edge nats-server[500]: [500] 2023

We also observed these messages (from yesterday's logs)

ream '$G > sugarbox': sequence numbers for cache load did not match, 2 vs 1
May 23 09:09:10 edge nats-server[26060]: [26060] 2023/05/23 09:09:10.735894 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': sequence numbers for cache load did not match, 2 vs 1
May 23 09:09:20 edge nats-server[26060]: [26060] 2023/05/23 09:09:20.936855 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': sequence numbers for cache load did not match, 2 vs 1
May 23 09:09:31 edge nats-server[26060]: [26060] 2023/05/23 09:09:31.135958 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': sequence numbers for cache load did not match, 2 vs 1
May 23 09:09:41 edge nats-server[26060]: [26060] 2023/05/23 09:09:41.322481 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': sequence numbers for cache load did not match, 2 vs 1
May 23 09:09:51 edge nats-server[26060]: [26060] 2023/05/23 09:09:51.516638 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': sequence numbers for cache load did not match, 2 vs 1
May 23 09:11:07 edge nats-server[26060]: [26060] 2023/05/23 09:11:07.964030 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': malformed or corrupt message
May 23 09:11:07 edge nats-server[26060]: [26060] 2023/05/23 09:11:07.995975 [ERR] Received an error looking up message for consumer: malformed or corrupt message
May 23 09:11:31 edge nats-server[26060]: [26060] 2023/05/23 09:11:31.653378 [ERR] Received an error looking up message for consumer: malformed or corrupt message
May 23 09:12:48 edge nats-server[26060]: [26060] 2023/05/23 09:12:48.981705 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': sequence numbers for cache load did not match, 1 vs 13
May 23 09:13:17 edge nats-server[26060]: [26060] 2023/05/23 09:13:17.073034 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': sequence numbers for cache load did not match, 1 vs 13
May 23 09:13:45 edge nats-server[26060]: [26060] 2023/05/23 09:13:45.188708 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': sequence numbers for cache load did not match, 1 vs 13
May 23 09:14:13 edge nats-server[26060]: [26060] 2023/05/23 09:14:13.269518 [ERR] JetStream failed to store a msg on stream '$G > sugarbox': sequence numbers for cache load did not match, 1 vs 13
May 23 10:00:58 edge nats-server[26060]: [26060] 2023/05/23 10:00:58.070405 [ERR] Received an error looking up message for consumer: malformed or corrupt message

Lastly we have retention of 3 days the message was published to nats yesterday we still have time before the retention happens yet we see 5 messages as deleted

nats stream info
? Select a Stream sugarbox
Information for Stream sugarbox created 2023-05-23 08:50:34

             Subjects: asset.available, asset.availableFailed, asset.downloaded, asset.failed, availabilty.checker, content.available, content.disable, content.ingest, license.available, license.failed, meta.available, meta.deleted, meta.failed, reachability.changed
             Replicas: 1
              Storage: File

Options:

            Retention: Limits
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 3d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

State:

             Messages: 710
                Bytes: 2.2 MiB
             FirstSeq: 13 @ 2023-05-23T09:11:31 UTC
              LastSeq: 727 @ 2023-05-24T06:55:52 UTC
     Deleted Messages: 5
     Active Consumers: 20

Any help as to why this happened here we have our production isn't working for the last 24 hours.

Details

The Nats is **Single Server**
SDD is internal, not external.
cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster

Linux c946a9 4.19.0-11-amd64 #1 SMP Debian 4.19.146-1 (2020-09-17) x86_64 GNU/Linux
nats-server: v2.9.15

We started the nats using systemctl unit file that start nats like this

/usr/bin/nats-server -js -sd /var/lib/nats-server/
wallyqs commented 1 year ago

Sorry for the inconvenience in your setup, this issue with the server in standalone mode has been fixed since v2.9.16 here: https://github.com/nats-io/nats-server/issues/2993

ptanh2k commented 10 months ago

Hey @wallyqs, I'm using JetStream with a cluster of 3 nodes and I'm having the same issue, can you detail the fix mentioned above?

I'm currently using nats-server 2.9.19