nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.27k stars 1.37k forks source link

NATS orphened streams #4410

Open kamilhalat opened 11 months ago

kamilhalat commented 11 months ago

Defect

On AKS restarts (not every time) all on startup all streams are getting flagged as orphend and are deleted:

nats-0 logs:

[7] 2023/08/08 05:50:52.363978 [INF] Starting nats-server
[7] 2023/08/08 05:50:52.364008 [INF]   Version:  2.9.20
[7] 2023/08/08 05:50:52.364010 [INF]   Git:      [97dd7cb]
[7] 2023/08/08 05:50:52.364012 [INF]   Cluster:  nats
[7] 2023/08/08 05:50:52.364013 [INF]   Name:     nats-0
[7] 2023/08/08 05:50:52.364017 [INF]   Node:     S1Nunr6R
[7] 2023/08/08 05:50:52.364019 [INF]   ID:       NCXYJAK45SXSLZUJ6NBAPB73W4KDPJRM2KJ6HHZ2SLAG4OW7ILWBPZQ2
[7] 2023/08/08 05:50:52.364048 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/08/08 05:50:52.364557 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/08/08 05:50:52.365150 [INF] Starting JetStream
[7] 2023/08/08 05:50:52.383185 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[7] 2023/08/08 05:50:52.383192 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[7] 2023/08/08 05:50:52.383194 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[7] 2023/08/08 05:50:52.383196 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[7] 2023/08/08 05:50:52.383197 [INF] 
[7] 2023/08/08 05:50:52.383199 [INF]          https://docs.nats.io/jetstream
[7] 2023/08/08 05:50:52.383200 [INF] 
[7] 2023/08/08 05:50:52.383202 [INF] ---------------- JETSTREAM ----------------
[7] 2023/08/08 05:50:52.383209 [INF]   Max Memory:      1.00 GB
[7] 2023/08/08 05:50:52.383211 [INF]   Max Storage:     8.00 GB
[7] 2023/08/08 05:50:52.383213 [INF]   Store Directory: "/data/jetstream"
[7] 2023/08/08 05:50:52.383215 [INF] -------------------------------------------
[7] 2023/08/08 05:50:52.392276 [INF]   Starting restore for stream 'aos > STREAM_5'
[7] 2023/08/08 05:50:52.583128 [INF]   Restored 336,677 messages for stream 'aos > STREAM_5'
[7] 2023/08/08 05:50:52.583774 [INF]   Starting restore for stream 'aos > STREAM_6'
[7] 2023/08/08 05:50:52.584267 [INF]   Restored 0 messages for stream 'aos > STREAM_6'
[7] 2023/08/08 05:50:52.584409 [INF]   Recovering 8 consumers for stream - 'aos > STREAM_5'
[7] 2023/08/08 05:50:52.617535 [INF]   Recovering 1 consumers for stream - 'aos > STREAM_6'
[7] 2023/08/08 05:50:52.622102 [INF] Starting JetStream cluster
[7] 2023/08/08 05:50:52.622188 [INF] Creating JetStream metadata controller
[7] 2023/08/08 05:50:52.632756 [INF] JetStream cluster recovering state
[7] 2023/08/08 05:50:52.636515 [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] 2023/08/08 05:50:52.636898 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2023/08/08 05:50:52.637083 [INF] Server is ready
[7] 2023/08/08 05:50:52.637369 [INF] Cluster name is nats
[7] 2023/08/08 05:50:52.637454 [INF] Listening for route connections on 0.0.0.0:6222
[7] 2023/08/08 05:50:52.638287 [INF] JetStream cluster new stream leader for 'aos > STREAM_6'
[7] 2023/08/08 05:50:52.638360 [INF] JetStream cluster new consumer leader for 'aos > STREAM_6 > DEVELOPMENT-STREAM_6_CONSUMER-STREAM_6-devices'
[7] 2023/08/08 05:50:52.641160 [INF] y.y.y.y:6222 - rid:31 - Route connection created
[7] 2023/08/08 05:50:52.650452 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-2.nats.instance-01.svc.cluster.local": lookup nats-2.nats.instance-01.svc.cluster.local on 172.17.0.4:53: no such host
[7] 2023/08/08 05:50:52.674291 [INF] JetStream cluster new stream leader for 'aos > STREAM_5'
[7] 2023/08/08 05:50:52.674351 [INF] JetStream cluster new consumer leader for 'aos > STREAM_5 > DEVELOPMENT-STREAM_5-orders_priority'
[7] 2023/08/08 05:50:52.674508 [INF] JetStream cluster new consumer leader for 'aos > STREAM_5 > DEVELOPMENT-STREAM_5-orders'
[7] 2023/08/08 05:50:52.716988 [INF] JetStream cluster new consumer leader for 'aos > STREAM_5 > DEVELOPMENT-STREAM_5_STATISTICS_CONSUMER-STREAM_5-DEVELOPMENT-statistics'
[7] 2023/08/08 05:50:52.735198 [INF] JetStream cluster new consumer leader for 'aos > STREAM_5 > DEVELOPMENT-STREAM_5_STATISTICS_CONSUMER_PRIORITY-STREAM_5-DEVELOPMENT-statistics_priority'
[7] 2023/08/08 05:50:52.735334 [INF] JetStream cluster new consumer leader for 'aos > STREAM_5 > DEVELOPMENT-STREAM_5_COUNTERS_CONSUMER-STREAM_5-DEVELOPMENT-counters'
[7] 2023/08/08 05:50:52.736166 [INF] JetStream cluster new consumer leader for 'aos > STREAM_5 > DEVELOPMENT-STREAM_5_COUNTERS_CONSUMER_PRIORITY-STREAM_5-DEVELOPMENT-counters_priority'
[7] 2023/08/08 05:50:52.736305 [INF] JetStream cluster new consumer leader for 'aos > STREAM_5 > DEVELOPMENT-STREAM_5_STATES_CONSUMER-STREAM_5-DEVELOPMENT-states'
[7] 2023/08/08 05:50:52.737659 [WRN] Waiting for routing to be established...
[7] 2023/08/08 05:50:52.781446 [INF] JetStream cluster new consumer leader for 'aos > STREAM_5 > DEVELOPMENT-STREAM_5_STATES_CONSUMER_PRIORITY-STREAM_5-DEVELOPMENT-states_priority'
[7] 2023/08/08 05:50:52.781562 [WRN] JetStream has not established contact with a meta leader
[7] 2023/08/08 05:50:54.687630 [INF] z.z.z.z:6222 - rid:32 - Route connection created
[7] 2023/08/08 05:50:55.223617 [INF] z.z.z.z:55150 - rid:33 - Route connection created
[7] 2023/08/08 05:50:55.223844 [INF] z.z.z.z:55150 - rid:33 - Router connection closed: Duplicate Route
[7] 2023/08/08 05:50:55.475617 [INF] y.y.y.y:41618 - rid:34 - Route connection created
[7] 2023/08/08 05:50:55.475973 [INF] y.y.y.y:41618 - rid:34 - Router connection closed: Duplicate Route
[7] 2023/08/08 05:51:02.097690 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[7] 2023/08/08 05:51:04.079826 [WRN] RAFT [S1Nunr6R - _meta_] Resetting WAL state
[7] 2023/08/08 05:51:04.080232 [INF] JetStream cluster new metadata leader: nats-1/nats

nats-1 logs:

[7] 2023/08/08 05:50:42.214139 [INF] Starting nats-server
[7] 2023/08/08 05:50:42.214194 [INF]   Version:  2.9.20
[7] 2023/08/08 05:50:42.214197 [INF]   Git:      [97dd7cb]
[7] 2023/08/08 05:50:42.214199 [INF]   Cluster:  nats
[7] 2023/08/08 05:50:42.214201 [INF]   Name:     nats-1
[7] 2023/08/08 05:50:42.214204 [INF]   Node:     yrzKKRBu
[7] 2023/08/08 05:50:42.214205 [INF]   ID:       NCU5MCL3LGOGISMGTD3G77KSD4Q72FLFTEF7MA7CC6OQXNMNL4OGU6XD
[7] 2023/08/08 05:50:42.214248 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/08/08 05:50:42.214696 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/08/08 05:50:42.214737 [INF] Starting JetStream
[7] 2023/08/08 05:50:42.224747 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[7] 2023/08/08 05:50:42.224755 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[7] 2023/08/08 05:50:42.224757 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[7] 2023/08/08 05:50:42.224758 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[7] 2023/08/08 05:50:42.224760 [INF] 
[7] 2023/08/08 05:50:42.224761 [INF]          https://docs.nats.io/jetstream
[7] 2023/08/08 05:50:42.224763 [INF] 
[7] 2023/08/08 05:50:42.224764 [INF] ---------------- JETSTREAM ----------------
[7] 2023/08/08 05:50:42.224771 [INF]   Max Memory:      1.00 GB
[7] 2023/08/08 05:50:42.224774 [INF]   Max Storage:     8.00 GB
[7] 2023/08/08 05:50:42.224775 [INF]   Store Directory: "/data/jetstream"
[7] 2023/08/08 05:50:42.224777 [INF] -------------------------------------------
[7] 2023/08/08 05:50:42.236133 [INF]   Starting restore for stream 'aos > STREAM_4'
[7] 2023/08/08 05:50:42.262829 [INF]   Restored 21 messages for stream 'aos > STREAM_4'
[7] 2023/08/08 05:50:42.265431 [INF]   Recovering 1 consumers for stream - 'aos > STREAM_4'
[7] 2023/08/08 05:50:42.286014 [INF] Starting JetStream cluster
[7] 2023/08/08 05:50:42.286025 [INF] Creating JetStream metadata controller
[7] 2023/08/08 05:50:42.300253 [INF] JetStream cluster recovering state
[7] 2023/08/08 05:50:42.300654 [WRN] RAFT [yrzKKRBu - _meta_] Snapshot corrupt, too short
[7] 2023/08/08 05:50:42.300774 [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] 2023/08/08 05:50:42.301058 [WRN] Detected orphaned stream 'aos > STREAM_4', will cleanup
[7] 2023/08/08 05:50:42.301204 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2023/08/08 05:50:42.301308 [INF] Server is ready
[7] 2023/08/08 05:50:42.301464 [INF] Cluster name is nats
[7] 2023/08/08 05:50:42.301499 [INF] Listening for route connections on 0.0.0.0:6222
[7] 2023/08/08 05:50:42.302833 [WRN] JetStream has not established contact with a meta leader
[7] 2023/08/08 05:50:42.311471 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-0.nats.instance-01.svc.cluster.local": lookup nats-0.nats.instance-01.svc.cluster.local on 172.17.0.4:53: no such host
[7] 2023/08/08 05:50:42.312526 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-2.nats.instance-01.svc.cluster.local": lookup nats-2.nats.instance-01.svc.cluster.local on 172.17.0.4:53: no such host
[7] 2023/08/08 05:50:42.402904 [WRN] Waiting for routing to be established...
[7] 2023/08/08 05:50:44.309333 [WRN] Waiting for routing to be established...
[7] 2023/08/08 05:50:46.218036 [WRN] Waiting for routing to be established...
[7] 2023/08/08 05:50:48.231934 [WRN] Waiting for routing to be established...
[7] 2023/08/08 05:50:49.166865 [INF] z.z.z.z:52388 - rid:12 - Route connection created
[7] 2023/08/08 05:50:49.261008 [WRN] RAFT [yrzKKRBu - _meta_] Wrong index, ae is &{leader:yrzKKRBu term:22 commit:0 pterm:0 pindex:0 entries: 1}, index stored was 165, n.pindex is 0, will reset
[7] 2023/08/08 05:50:49.261023 [WRN] RAFT [yrzKKRBu - _meta_] Resetting WAL state
[7] 2023/08/08 05:50:49.261311 [INF] Self is new JetStream cluster metadata leader
[7] 2023/08/08 05:50:49.261380 [INF] JetStream cluster no metadata leader
[7] 2023/08/08 05:50:52.640852 [INF] x.x.x.x:40930 - rid:13 - Route connection created
[7] 2023/08/08 05:50:55.475409 [INF] x.x.x.x:6222 - rid:14 - Route connection created
[7] 2023/08/08 05:50:55.476141 [INF] x.x.x.x:6222 - rid:14 - Router connection closed: Duplicate Route
[7] 2023/08/08 05:50:55.483065 [INF] z.z.z.z:6222 - rid:15 - Route connection created
[7] 2023/08/08 05:50:55.483344 [INF] z.z.z.z:6222 - rid:15 - Router connection closed: Duplicate Route
[7] 2023/08/08 05:50:58.699729 [WRN] Healthcheck failed: "JetStream has not established contact with a meta leader"
[7] 2023/08/08 05:51:04.077488 [INF] Self is new JetStream cluster metadata leader

nats-2 logs:

[8] 2023/08/08 05:50:47.958138 [INF] Starting nats-server
[8] 2023/08/08 05:50:47.958210 [INF]   Version:  2.9.20
[8] 2023/08/08 05:50:47.958224 [INF]   Git:      [97dd7cb]
[8] 2023/08/08 05:50:47.958236 [INF]   Cluster:  nats
[8] 2023/08/08 05:50:47.958339 [INF]   Name:     nats-2
[8] 2023/08/08 05:50:47.958363 [INF]   Node:     cnrtt3eg
[8] 2023/08/08 05:50:47.958435 [INF]   ID:       NDGY2BMX5KUGFDNGO5PJFBBQGFQMRMBCJYXS4VSO2AOLBZIP2WS6FGCO
[8] 2023/08/08 05:50:47.958547 [INF] Using configuration file: /etc/nats-config/nats.conf
[8] 2023/08/08 05:50:47.959200 [INF] Starting http monitor on 0.0.0.0:8222
[8] 2023/08/08 05:50:47.959235 [INF] Starting JetStream
[8] 2023/08/08 05:50:47.978305 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[8] 2023/08/08 05:50:47.978318 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[8] 2023/08/08 05:50:47.978319 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[8] 2023/08/08 05:50:47.978321 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[8] 2023/08/08 05:50:47.978323 [INF] 
[8] 2023/08/08 05:50:47.978325 [INF]          https://docs.nats.io/jetstream
[8] 2023/08/08 05:50:47.978326 [INF] 
[8] 2023/08/08 05:50:47.978328 [INF] ---------------- JETSTREAM ----------------
[8] 2023/08/08 05:50:47.978335 [INF]   Max Memory:      1.00 GB
[8] 2023/08/08 05:50:47.978338 [INF]   Max Storage:     8.00 GB
[8] 2023/08/08 05:50:47.978340 [INF]   Store Directory: "/data/jetstream"
[8] 2023/08/08 05:50:47.978342 [INF] -------------------------------------------
[8] 2023/08/08 05:50:47.990926 [INF]   Starting restore for stream 'aos > STREAM_1'
[8] 2023/08/08 05:50:48.786136 [INF]   Restored 1,963,810 messages for stream 'aos > STREAM_1'
[8] 2023/08/08 05:50:48.791476 [INF]   Starting restore for stream 'aos > STREAM_2'
[8] 2023/08/08 05:50:49.089430 [INF]   Restored 668,757 messages for stream 'aos > STREAM_2'
[8] 2023/08/08 05:50:49.093096 [INF]   Starting restore for stream 'aos > STREAM_3'
[8] 2023/08/08 05:50:49.099551 [INF]   Restored 0 messages for stream 'aos > STREAM_3'
[8] 2023/08/08 05:50:49.101731 [INF]   Recovering 4 consumers for stream - 'aos > STREAM_1'
[8] 2023/08/08 05:50:49.130984 [INF]   Recovering 4 consumers for stream - 'aos > STREAM_2'
[8] 2023/08/08 05:50:49.152719 [INF] Starting JetStream cluster
[8] 2023/08/08 05:50:49.152750 [INF] Creating JetStream metadata controller
[8] 2023/08/08 05:50:49.158817 [INF] JetStream cluster recovering state
[8] 2023/08/08 05:50:49.159252 [WRN] RAFT [cnrtt3eg - _meta_] Snapshot corrupt, too short
[8] 2023/08/08 05:50:49.159402 [INF] Listening for leafnode connections on 0.0.0.0:7422
[8] 2023/08/08 05:50:49.159489 [WRN] Detected orphaned stream 'aos > STREAM_1', will cleanup
[8] 2023/08/08 05:50:49.159822 [INF] Listening for client connections on 0.0.0.0:4222
[8] 2023/08/08 05:50:49.159934 [INF] Server is ready
[8] 2023/08/08 05:50:49.161391 [INF] Cluster name is nats
[8] 2023/08/08 05:50:49.161421 [INF] Listening for route connections on 0.0.0.0:6222
[8] 2023/08/08 05:50:49.164210 [WRN] Detected orphaned stream 'aos > STREAM_2', will cleanup
[8] 2023/08/08 05:50:49.165881 [WRN] Detected orphaned stream 'aos > STREAM_3', will cleanup
[8] 2023/08/08 05:50:49.166380 [WRN] JetStream has not established contact with a meta leader
[8] 2023/08/08 05:50:49.166627 [INF] y.y.y.y:6222 - rid:32 - Route connection created
[8] 2023/08/08 05:50:49.179328 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-0.nats.instance-01.svc.cluster.local": lookup nats-0.nats.instance-01.svc.cluster.local on 172.17.0.4:53: no such host
[8] 2023/08/08 05:50:49.260449 [WRN] Waiting for routing to be established...
[8] 2023/08/08 05:50:49.262042 [INF] JetStream cluster new metadata leader: nats-1/nats
[8] 2023/08/08 05:50:54.687850 [INF] x.x.x.x:37326 - rid:33 - Route connection created
[8] 2023/08/08 05:50:55.223404 [INF] x.x.x.x:6222 - rid:34 - Route connection created
[8] 2023/08/08 05:50:55.223793 [INF] x.x.x.x:6222 - rid:34 - Router connection closed: Duplicate Route
[8] 2023/08/08 05:50:55.483236 [INF] y.y.y.y:33866 - rid:35 - Route connection created
[8] 2023/08/08 05:50:55.483556 [INF] y.y.y.y:33866 - rid:35 - Router connection closed: Duplicate Route
[8] 2023/08/08 05:50:57.692880 [WRN] Healthcheck failed: "JetStream is not current with the meta leader"

IPs x.x.x.x, y.y.y.y and z.z.z.z was change for anonymity

Make sure that these boxes are checked before submitting your issue -- thank you!

sadly still waiting for it to happen on NATS server with trace and debug enabled

Versions of nats-server and affected client libraries used:

Lates tested version: 2.9.20-alpine Fist encounter: 2.9.16-alpine

OS/Container environment:

AKS, helm values:

nats:
  image:
    tag: 2.9.20-alpine

  jetstream:
    enabled: true
    memStorage:
      enabled: true
      size: 1Gi

    fileStorage:
      enabled: true
      size: 8Gi
      storageDirectory: /data/
      storageClassName: default

auth:
  enabled: true
  systemAccount: SYS
  basic:
    accounts: 
      aos:
        imports: []
        exports: []
        jetstream: enabled
        users:
        - user: acc
          pass: sys
      SYS:
        imports: []
        exports: []
        users:
        - user: sys
          pass: sys
cluster:
  enabled: true
  replicas: 3
  noAdvertise: true

leafnodes:
  enabled: true
  port: 7422
  noAdvertise: true

natsbox:
  enabled: true

Steps or code to reproduce the issue:

     At this moment i'm not sure how to reproduce it, it happens only on cluster restart and not every time, once head it happen within two weeks once in three days, wasn't able to reproduce it on clean not used NATS-server.

Expected result:

      Streams don't get flagged as orphend and deleted on cluster restart

Actual result:

     Streams are getting flagged as orphaned and deleted for unexpected reasons on cluster restart
derekcollison commented 11 months ago

Might suggest upgrading to 2.9.21.

kamilhalat commented 10 months ago

Hi, sadly it happened again on version 2.9.21, don't have full log at the moment and i have only 50 lines from every node:

node-0:

[7] 2023/09/14 16:32:49.460496 [INF] Starting nats-server
[7] 2023/09/14 16:32:49.460528 [INF]   Version:  2.9.21
[7] 2023/09/14 16:32:49.460530 [INF]   Git:      [b2e7725]
[7] 2023/09/14 16:32:49.460532 [DBG]   Go build: go1.19.12
[7] 2023/09/14 16:32:49.460533 [INF]   Cluster:  nats
[7] 2023/09/14 16:32:49.460535 [INF]   Name:     nats-0
[7] 2023/09/14 16:32:49.460539 [INF]   Node:     S1Nunr6R
[7] 2023/09/14 16:32:49.460541 [INF]   ID:       NDOBT43MQZYHTSUG7US2XNLTJCNELF3C3YTVWIVQ6G76D6JIQ27JRRO7
[7] 2023/09/14 16:32:49.460873 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/09/14 16:32:49.461533 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/09/14 16:32:49.461611 [INF] Starting JetStream
[7] 2023/09/14 16:32:49.467808 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[7] 2023/09/14 16:32:49.467821 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[7] 2023/09/14 16:32:49.467823 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[7] 2023/09/14 16:32:49.467825 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[7] 2023/09/14 16:32:49.467826 [INF] 
[7] 2023/09/14 16:32:49.467828 [INF]          https://docs.nats.io/jetstream
[7] 2023/09/14 16:32:49.467829 [INF] 
[7] 2023/09/14 16:32:49.467831 [INF] ---------------- JETSTREAM ----------------
[7] 2023/09/14 16:32:49.467915 [INF]   Max Memory:      1.00 GB
[7] 2023/09/14 16:32:49.467924 [INF]   Max Storage:     8.00 GB
[7] 2023/09/14 16:32:49.467926 [INF]   Store Directory: "/data/jetstream"
[7] 2023/09/14 16:32:49.467928 [INF] -------------------------------------------
[7] 2023/09/14 16:32:49.468086 [DBG]   Exports:
[7] 2023/09/14 16:32:49.468093 [DBG]      $JS.API.>
[7] 2023/09/14 16:32:49.468286 [DBG] Enabled JetStream for account "test"
[7] 2023/09/14 16:32:49.468296 [DBG]   Max Memory:      -1 B
[7] 2023/09/14 16:32:49.468298 [DBG]   Max Storage:     -1 B
[7] 2023/09/14 16:32:49.468313 [DBG] Recovering JetStream state for account "test"
[7] 2023/09/14 16:32:49.474190 [INF]   Starting restore for stream 'test > STREAM1'
[7] 2023/09/14 16:32:49.479714 [INF]   Restored 0 messages for stream 'test > STREAM1'
[7] 2023/09/14 16:32:49.480039 [INF]   Starting restore for stream 'test > STREAM2'
[7] 2023/09/14 16:32:49.480473 [INF]   Restored 0 messages for stream 'test > STREAM2'
[7] 2023/09/14 16:32:49.480684 [DBG] JetStream state for account "test" recovered
[7] 2023/09/14 16:32:49.480708 [INF] Starting JetStream cluster
[7] 2023/09/14 16:32:49.480715 [DBG] JetStream cluster checking for stable cluster name and peers
[7] 2023/09/14 16:32:49.480731 [INF] Creating JetStream metadata controller
[7] 2023/09/14 16:32:49.490203 [INF] JetStream cluster recovering state
[7] 2023/09/14 16:32:49.495625 [WRN] RAFT [S1Nunr6R - _meta_] Snapshot corrupt, checksums did not match
[7] 2023/09/14 16:32:49.495691 [DBG] RAFT [S1Nunr6R - _meta_] Started
[7] 2023/09/14 16:32:49.495755 [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] 2023/09/14 16:32:49.495760 [DBG] Get non local IPs for "0.0.0.0"
[7] 2023/09/14 16:32:49.495885 [DBG]   ip=x.x.x.x
[7] 2023/09/14 16:32:49.496029 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2023/09/14 16:32:49.496033 [DBG] Get non local IPs for "0.0.0.0"
[7] 2023/09/14 16:32:49.496129 [DBG]   ip=x.x.x.x
[7] 2023/09/14 16:32:49.496135 [INF] Server is ready
[7] 2023/09/14 16:32:49.496327 [DBG] maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
[7] 2023/09/14 16:32:49.496418 [DBG] Starting metadata monitor
[7] 2023/09/14 16:32:49.496445 [DBG] Recovered JetStream cluster metadata

node-1:

[7] 2023/09/14 16:32:51.314526 [INF] Starting nats-server
[7] 2023/09/14 16:32:51.314569 [INF]   Version:  2.9.21
[7] 2023/09/14 16:32:51.314571 [INF]   Git:      [b2e7725]
[7] 2023/09/14 16:32:51.314574 [DBG]   Go build: go1.19.12
[7] 2023/09/14 16:32:51.314575 [INF]   Cluster:  nats
[7] 2023/09/14 16:32:51.314577 [INF]   Name:     nats-1
[7] 2023/09/14 16:32:51.314580 [INF]   Node:     yrzKKRBu
[7] 2023/09/14 16:32:51.314582 [INF]   ID:       NA2IKGH4ZTUDY3QT5LHB3ZJ35RZEVR6DIZ4MD5M2XCTCHW6O6ZE6L6G4
[7] 2023/09/14 16:32:51.314618 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/09/14 16:32:51.315188 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/09/14 16:32:51.315218 [INF] Starting JetStream
[7] 2023/09/14 16:32:51.329212 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[7] 2023/09/14 16:32:51.329221 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[7] 2023/09/14 16:32:51.329226 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[7] 2023/09/14 16:32:51.329228 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[7] 2023/09/14 16:32:51.329230 [INF] 
[7] 2023/09/14 16:32:51.329232 [INF]          https://docs.nats.io/jetstream
[7] 2023/09/14 16:32:51.329233 [INF] 
[7] 2023/09/14 16:32:51.329235 [INF] ---------------- JETSTREAM ----------------
[7] 2023/09/14 16:32:51.329242 [INF]   Max Memory:      1.00 GB
[7] 2023/09/14 16:32:51.329245 [INF]   Max Storage:     8.00 GB
[7] 2023/09/14 16:32:51.329247 [INF]   Store Directory: "/data/jetstream"
[7] 2023/09/14 16:32:51.329248 [INF] -------------------------------------------
[7] 2023/09/14 16:32:51.329343 [DBG]   Exports:
[7] 2023/09/14 16:32:51.329346 [DBG]      $JS.API.>
[7] 2023/09/14 16:32:51.339013 [DBG] Enabled JetStream for account "test"
[7] 2023/09/14 16:32:51.339024 [DBG]   Max Memory:      -1 B
[7] 2023/09/14 16:32:51.339027 [DBG]   Max Storage:     -1 B
[7] 2023/09/14 16:32:51.339042 [DBG] Recovering JetStream state for account "test"
[7] 2023/09/14 16:32:51.350189 [INF]   Starting restore for stream 'test > STREAM3'
[7] 2023/09/14 16:32:52.008518 [INF]   Restored 1,000,000 messages for stream 'test > STREAM3'
[7] 2023/09/14 16:32:52.020624 [INF]   Starting restore for stream 'test > STREAM5'
[7] 2023/09/14 16:32:52.356902 [INF]   Restored 367,730 messages for stream 'test > STREAM5'
[7] 2023/09/14 16:32:52.374432 [INF]   Starting restore for stream 'test > STREAM4'
[7] 2023/09/14 16:32:52.546136 [INF]   Restored 158,887 messages for stream 'test > STREAM4'
[7] 2023/09/14 16:32:52.550342 [INF]   Starting restore for stream 'test > PING'
[7] 2023/09/14 16:32:52.554642 [INF]   Restored 0 messages for stream 'test > PING'
[7] 2023/09/14 16:32:52.558290 [INF]   Recovering 4 consumers for stream - 'test > STREAM3'
[7] 2023/09/14 16:32:52.599885 [INF]   Recovering 8 consumers for stream - 'test > STREAM5'
[7] 2023/09/14 16:32:52.642655 [INF]   Recovering 4 consumers for stream - 'test > STREAM4'
[7] 2023/09/14 16:32:52.650982 [INF]   Recovering 1 consumers for stream - 'test > PING'
[7] 2023/09/14 16:32:52.654863 [DBG] JetStream state for account "test" recovered
[7] 2023/09/14 16:32:52.654950 [INF] Starting JetStream cluster
[7] 2023/09/14 16:32:52.654953 [DBG] JetStream cluster checking for stable cluster name and peers
[7] 2023/09/14 16:32:52.654956 [INF] Creating JetStream metadata controller
[7] 2023/09/14 16:32:52.668319 [INF] JetStream cluster recovering state
[7] 2023/09/14 16:32:52.672809 [DBG] RAFT [yrzKKRBu - _meta_] Started
[7] 2023/09/14 16:32:52.673016 [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] 2023/09/14 16:32:52.673051 [DBG] Starting metadata monitor
[7] 2023/09/14 16:32:52.673999 [DBG] Get non local IPs for "0.0.0.0"

node-2:

[7] 2023/09/14 16:32:50.377838 [INF] Starting nats-server
[7] 2023/09/14 16:32:50.377886 [INF]   Version:  2.9.21
[7] 2023/09/14 16:32:50.377899 [INF]   Git:      [b2e7725]
[7] 2023/09/14 16:32:50.377911 [DBG]   Go build: go1.19.12
[7] 2023/09/14 16:32:50.377922 [INF]   Cluster:  nats
[7] 2023/09/14 16:32:50.377934 [INF]   Name:     nats-2
[7] 2023/09/14 16:32:50.377947 [INF]   Node:     cnrtt3eg
[7] 2023/09/14 16:32:50.377958 [INF]   ID:       NDXDU7MK6I7GMZR5GVBM422IXA2CJA4E5MPOBKJOHT3KIUILKPPAVZHM
[7] 2023/09/14 16:32:50.377994 [INF] Using configuration file: /etc/nats-config/nats.conf
[7] 2023/09/14 16:32:50.378512 [INF] Starting http monitor on 0.0.0.0:8222
[7] 2023/09/14 16:32:50.378582 [INF] Starting JetStream
[7] 2023/09/14 16:32:50.391948 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[7] 2023/09/14 16:32:50.391996 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[7] 2023/09/14 16:32:50.392059 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[7] 2023/09/14 16:32:50.392094 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[7] 2023/09/14 16:32:50.392108 [INF] 
[7] 2023/09/14 16:32:50.392122 [INF]          https://docs.nats.io/jetstream
[7] 2023/09/14 16:32:50.392133 [INF] 
[7] 2023/09/14 16:32:50.392147 [INF] ---------------- JETSTREAM ----------------
[7] 2023/09/14 16:32:50.392177 [INF]   Max Memory:      1.00 GB
[7] 2023/09/14 16:32:50.392201 [INF]   Max Storage:     8.00 GB
[7] 2023/09/14 16:32:50.392221 [INF]   Store Directory: "/data/jetstream"
[7] 2023/09/14 16:32:50.392233 [INF] -------------------------------------------
[7] 2023/09/14 16:32:50.392348 [DBG]   Exports:
[7] 2023/09/14 16:32:50.392379 [DBG]      $JS.API.>
[7] 2023/09/14 16:32:50.400506 [DBG] Enabled JetStream for account "test"
[7] 2023/09/14 16:32:50.400554 [DBG]   Max Memory:      -1 B
[7] 2023/09/14 16:32:50.400571 [DBG]   Max Storage:     -1 B
[7] 2023/09/14 16:32:50.400605 [DBG] Recovering JetStream state for account "test"
[7] 2023/09/14 16:32:50.412566 [INF]   Starting restore for stream 'test > STREAM6'
[7] 2023/09/14 16:32:50.551325 [INF]   Restored 21 messages for stream 'test > STREAM6'
[7] 2023/09/14 16:32:50.555737 [INF]   Recovering 1 consumers for stream - 'test > STREAM6'
[7] 2023/09/14 16:32:50.565477 [DBG] JetStream state for account "test" recovered
[7] 2023/09/14 16:32:50.565551 [INF] Starting JetStream cluster
[7] 2023/09/14 16:32:50.565559 [DBG] JetStream cluster checking for stable cluster name and peers
[7] 2023/09/14 16:32:50.565561 [INF] Creating JetStream metadata controller
[7] 2023/09/14 16:32:50.572642 [INF] JetStream cluster recovering state
[7] 2023/09/14 16:32:50.573388 [WRN] RAFT [cnrtt3eg - _meta_] Snapshot corrupt, too short
[7] 2023/09/14 16:32:50.573511 [DBG] RAFT [cnrtt3eg - _meta_] Started
[7] 2023/09/14 16:32:50.573624 [DBG] Starting metadata monitor
[7] 2023/09/14 16:32:50.573694 [DBG] Recovered JetStream cluster metadata
[7] 2023/09/14 16:32:50.573696 [DBG] JetStream cluster checking for orphans
[7] 2023/09/14 16:32:50.573715 [WRN] Detected orphaned stream 'test > STREAM6', will cleanup
[7] 2023/09/14 16:32:50.573631 [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] 2023/09/14 16:32:50.573859 [DBG] Get non local IPs for "0.0.0.0"
[7] 2023/09/14 16:32:50.573985 [DBG] JETSTREAM - JetStream connection closed: Client Closed
[7] 2023/09/14 16:32:50.574062 [DBG]   ip=y.y.y.y
[7] 2023/09/14 16:32:50.574295 [INF] Listening for client connections on 0.0.0.0:4222
[7] 2023/09/14 16:32:50.574304 [DBG] Get non local IPs for "0.0.0.0"
[7] 2023/09/14 16:32:50.574493 [DBG]   ip=y.y.y.y

I tried to reproduce it by sending lots of messages (10 000 000) and restarting AKS at the same time but it didn't happen. Do you have other ideas how to reproduce it?

ColinSullivan1 commented 3 months ago

Hi all, I wanted to add an update to this. With server version 2.10.12, this issue is much harder to reproduce, but is still present.

I've attached server logs with debug tracing.

We are still seeing: [7] [WRN] RAFT [cnrtt3eg - meta] Snapshot corrupt, too short when this happens.

[7] [INF] Creating JetStream metadata controller
[7] [WRN] Filestore [_meta_] Stream state too short (0 bytes)
[7] [INF] JetStream cluster recovering state
[7] [WRN] RAFT [cnrtt3eg - _meta_] Snapshot corrupt, too short
[7] [DBG] RAFT [cnrtt3eg - _meta_] Started
[7] [INF] Listening for leafnode connections on 0.0.0.0:7422
[7] [DBG] Get non local IPs for "0.0.0.0"
[7] [DBG]   ip=172.16.0.43
[7] [INF] Listening for MQTT clients on mqtt://0.0.0.0:1883
[7] [INF] Listening for client connections on 0.0.0.0:4222

In some cases, the corrupt message is seen when no streams configured on the server.

Note, there are a very large number of leaf nodes in this setup.

The backing file system is Azure storage, so I suspect this could be the file system the AKS PFC and setting the sync_interval value in the configuration to force writes using a lower value (1s) or even "always" might remedy this, but comes at a non-trivial performance cost. Any thoughts there would be appreciated.

Would there be any additional information you'd like to help debug this? Happy to hop on a call to discuss further; we can touch base over slack.

logs.zip

2/4/2024: UPDATE: MQTT is enabled so I'm thinking this may be occurring with the internal MQTT streams at times.

derekcollison commented 3 months ago

Do we have a reproducible test case or is this only seen in the user's environment?

ColinSullivan1 commented 3 months ago

This is only seen in the user's environment atm - it takes many iterations of stopping/restarting the server to reproduce.

derekcollison commented 3 months ago

Do they have an option of local storage to the node instead?

ColinSullivan1 commented 3 months ago

I'll check and let you know. The user's policy is to shut down the entire cluster nightly - we can look into local storage and incorporating snapshots/restores into their procedures.

derekcollison commented 3 months ago

ok. A short read is either the storage system did not write properly and truncated or truncated a read..

ColinSullivan1 commented 3 months ago

Unfortunately, using local storage in the container won't work for this case. We did verify the best storage class is being used in AKS. We'll experiment with sync_interval and if worse comes to worse deploy a NATS cluster outside of k8s on VMs. Additional suggestions would be appreciated...