nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

kill JetStream leader will lost messages #3402

Closed Doslin closed 2 years ago

Doslin commented 2 years ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

Versions of nats-server and affected client libraries used:

2.8.4 && 2.9.0-RC.2

OS/Container environment:

MacOS 12.3

Steps or code to reproduce the issue:

  1. nats-server -s "nats://172.24.144.238:4001,nats://172.24.144.238:4002,nats://172.24.144.238:4003,nats://172.24.144.238:4004"
  2. nats -s "nats://172.24.144.238:4004" str create cluster_stream_01 --subjects=group01 --storage=file --replicas=2 --retention=limits --discard=old --max-msgs=-1 --max-msgs-per-subject=-1 --max-bytes=-1 --max-age=-1 --max-msg-size=-1 --dupe-window=2m0s --allow-rollup --no-deny-delete --no-deny-purge
  3. nats pub -s "nats://172.24.144.238:4002,nats://172.24.144.238:4003" group01 --count 200000 "Message {{Count}} @ {{Time}}"
  4. kill leader nats://172.24.144.238:4004
  5. sleep 3s then start nats://172.24.144.238:4004
  6. nats -s "nats://172.24.144.238:4001,nats://172.24.144.238:4002" str info cluster_stream_01
  7. Messages: 161,355 Bytes: 9.3 MiB FirstSeq: 1 @ 2022-08-25T02:33:49 UTC LastSeq: 161,355 @ 2022-08-25T02:34:38 UTC

Expected result:

Messages: 200000

Actual result:

Messages: 161,355

Doslin commented 2 years ago

nats-server.config

172.24.144.238:4004

listen: 172.24.144.238:4004
http_port: 4804
debug: true
trace: true
logfile: /Users/zhilin/code/GoLand/nats-jet/src/nats-server/conf/.zhilin4/region4.sa.log
server_name: 172.24.144.238:4004
jetstream {
    store_dir="/Users/zhilin/code/GoLand/nats-jet/src/nats-server/conf/.zhilin4/"
}
cluster {
    name: zhilin_cluster
    listen: 172.24.144.238:4404
    routes = [
    nats-route://172.24.144.238:4101
    nats-route://172.24.144.238:4202
    ]
}

172.24.144.238:4003

listen: 172.24.144.238:4003
http_port: 4803
debug: true
trace: true
logfile: /Users/zhilin/code/GoLand/nats-jet/src/nats-server/conf/.zhilin3/region3.sa.log
server_name: 172.24.144.238:4003
jetstream {
    store_dir="/Users/zhilin/code/GoLand/nats-jet/src/nats-server/conf/.zhilin3/"
}
cluster {
    name: zhilin_cluster
    listen: 172.24.144.238:4303
    routes = [
    nats-route://172.24.144.238:4101
    nats-route://172.24.144.238:4202
    ]
}

172.24.144.238:4002

172.24.144.238:4001

Doslin commented 2 years ago

nats-server -DV output

172.24.144.238:4003

172.24.144.238:4003 ==> region3.sa.log

172.24.144.238:4001

replica 172.24.144.238:4001 ==> region1.sa.log

172.24.144.238:4004

JetStream Leader log 172.24.144.238:4004 ==> region4.sa.log

kozlovic commented 2 years ago

To me it feels like you are not sending "jetstream" messages in that the client does not wait for confirmation that the message was persisted, so when you kill the server as soon as the publisher finishes, there are actually possibly inflight messages that were not processed yet. You could use nats bench -js instead of simply pub and see if the outcome is different. Check nats bench --help for more information on possible arguments.