nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.27k stars 1.37k forks source link

Abnormal NATS write load associated with a specific jetstream #5423

Open alekseipalshin opened 2 months ago

alekseipalshin commented 2 months ago

Observed behavior

Hello.

We have detected high disk write load on some servers of the NATS cluster. (up to 100 MB/s)

nats_disk_usage

Restarting the NATS server helps reduce the disk load. We had no ideas what could have caused it, so we conducted an analysis using Strace:

407112 08:02:01 write(90</data/jetstream/MYSTREAM/streams/my-stream_prod/obs/my-stream_user_rating_fetch_did_request_v1_prod/o.dat>, "\26\2\345\255\241 \244\260\206\234\1\360\213\363\177\344\323\253\226\4\242\211\303\1\364\351\230\344\f\335\300\326"..., 58764653 <unfinished ...>
407112 08:02:01 <... write resumed> )   = 58764653
407112 08:02:02 write(43</data/jetstream/MYSTREAM/streams/my-stream_prod/obs/my-stream_user_rating_fetch_did_request_v1_prod/o.dat>, "\26\2\345\255\241 \244\260\206\234\1\363\213\363\177\361\323\253\226\4\243\211\303\1\364\351\230\344\f\253\264\275"..., 58764663 <unfinished ...>
407112 08:02:02 <... write resumed> )   = 58764663
407112 08:02:03 write(43</data/jetstream/MYSTREAM/streams/my-stream_prod/obs/my-stream_user_rating_fetch_did_request_v1_prod/o.dat>, "\26\2\345\255\241 \244\260\206\234\1\364\213\363\177\363\323\253\226\4\241\211\303\1\366\351\230\344\f\316\307\242"..., 58764643 <unfinished ...>
407112 08:02:03 <... write resumed> )   = 58764643

nats stream info

? Select a Stream my-stream_prod
Information for Stream my-stream_prod created 2023-04-27 14:25:50

             Subjects: my-stream.*.*.prod
             Replicas: 3
              Storage: File

Options:

            Retention: Limits
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2m0s
           Direct Get: true
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 15m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: nats
               Leader: nats-cluster-4
              Replica: nats-cluster-1, current, seen 0.00s ago
              Replica: nats-cluster-2, outdated, seen 0.00s ago, 1 operation behind

State:

             Messages: 111,685
                Bytes: 21 MiB
             FirstSeq: 4,136,545,200 @ 2024-05-15T10:51:20 UTC
              LastSeq: 4,136,656,884 @ 2024-05-15T11:06:20 UTC
     Active Consumers: 10
   Number of Subjects: 9

nats consumer info

? Select a Stream my-stream_prod
? Select a Consumer  [Use arrows to move, type to filter]
> 
  ...
  my-stream_user_rating_fetch_did_request_v1_prod
  ...

Expected behavior

Disk write throughput is ~1 MB/s like after restart

Server and client version

nats-server:2.9.16

github.com/nats-io/nats.go v1.31.0

Host environment

Kubernetes v1.26.12-eks

Installed from helm chart:

apiVersion: v2
appVersion: 2.8.4
description: A Helm chart for the NATS.io High Speed Cloud Native Distributed Communications
  Technology.
home: http://github.com/nats-io/k8s
icon: https://nats.io/img/nats-icon-color.png
keywords:
- nats
- messaging
- cncf
maintainers:
- email: wally@nats.io
  name: Waldemar Quevedo
  url: https://github.com/wallyqs
- email: colin@nats.io
  name: Colin Sullivan
  url: https://github.com/ColinSullivan1
- email: jaime@nats.io
  name: Jaime Piña
  url: https://github.com/nsurfer
- email: caleb@nats.io
  name: Caleb Lloyd
  url: https://github.com/caleblloyd
name: nats
version: 0.17.5

Steps to reproduce

No response

Jarema commented 2 months ago

This is a pretty old NATS server version. I would start with upgrading to latest 2.10.14 to see it that solves the problem. We did put a lot of effort into JetStream between those two versions.

alekseipalshin commented 2 months ago

Thank you, I'll try. When we updated the version from 2.8.4 to 2.9.16, the performance became much better.
Although the behavior seems strange, we have many other streams and there are no problems with them.