nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16.01k stars 1.41k forks source link

Abnormal NATS write load associated with a specific JetStream [v2.9.16, v2.10.16] #5423

Open alekseipalshin opened 6 months ago

alekseipalshin commented 6 months ago

Observed behavior

Hello.

We have detected high disk write load on some servers of the NATS cluster. (up to 100 MB/s)

nats_disk_usage

Restarting the NATS server helps reduce the disk load. We had no ideas what could have caused it, so we conducted an analysis using Strace:

407112 08:02:01 write(90</data/jetstream/MYSTREAM/streams/my-stream_prod/obs/my-stream_user_rating_fetch_did_request_v1_prod/o.dat>, "\26\2\345\255\241 \244\260\206\234\1\360\213\363\177\344\323\253\226\4\242\211\303\1\364\351\230\344\f\335\300\326"..., 58764653 <unfinished ...>
407112 08:02:01 <... write resumed> )   = 58764653
407112 08:02:02 write(43</data/jetstream/MYSTREAM/streams/my-stream_prod/obs/my-stream_user_rating_fetch_did_request_v1_prod/o.dat>, "\26\2\345\255\241 \244\260\206\234\1\363\213\363\177\361\323\253\226\4\243\211\303\1\364\351\230\344\f\253\264\275"..., 58764663 <unfinished ...>
407112 08:02:02 <... write resumed> )   = 58764663
407112 08:02:03 write(43</data/jetstream/MYSTREAM/streams/my-stream_prod/obs/my-stream_user_rating_fetch_did_request_v1_prod/o.dat>, "\26\2\345\255\241 \244\260\206\234\1\364\213\363\177\363\323\253\226\4\241\211\303\1\366\351\230\344\f\316\307\242"..., 58764643 <unfinished ...>
407112 08:02:03 <... write resumed> )   = 58764643

nats stream info

? Select a Stream my-stream_prod
Information for Stream my-stream_prod created 2023-04-27 14:25:50

             Subjects: my-stream.*.*.prod
             Replicas: 3
              Storage: File

Options:

            Retention: Limits
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2m0s
           Direct Get: true
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: unlimited
          Maximum Age: 15m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: nats
               Leader: nats-cluster-4
              Replica: nats-cluster-1, current, seen 0.00s ago
              Replica: nats-cluster-2, outdated, seen 0.00s ago, 1 operation behind

State:

             Messages: 111,685
                Bytes: 21 MiB
             FirstSeq: 4,136,545,200 @ 2024-05-15T10:51:20 UTC
              LastSeq: 4,136,656,884 @ 2024-05-15T11:06:20 UTC
     Active Consumers: 10
   Number of Subjects: 9

nats consumer info

? Select a Stream my-stream_prod
? Select a Consumer  [Use arrows to move, type to filter]
> 
  ...
  my-stream_user_rating_fetch_did_request_v1_prod
  ...

Expected behavior

Disk write throughput is ~1 MB/s like after restart

Server and client version

nats-server:2.9.16

github.com/nats-io/nats.go v1.31.0

Host environment

Kubernetes v1.26.12-eks

Installed from helm chart:

apiVersion: v2
appVersion: 2.8.4
description: A Helm chart for the NATS.io High Speed Cloud Native Distributed Communications
  Technology.
home: http://github.com/nats-io/k8s
icon: https://nats.io/img/nats-icon-color.png
keywords:
- nats
- messaging
- cncf
maintainers:
- email: wally@nats.io
  name: Waldemar Quevedo
  url: https://github.com/wallyqs
- email: colin@nats.io
  name: Colin Sullivan
  url: https://github.com/ColinSullivan1
- email: jaime@nats.io
  name: Jaime Piña
  url: https://github.com/nsurfer
- email: caleb@nats.io
  name: Caleb Lloyd
  url: https://github.com/caleblloyd
name: nats
version: 0.17.5

Steps to reproduce

No response

Jarema commented 6 months ago

This is a pretty old NATS server version. I would start with upgrading to latest 2.10.14 to see it that solves the problem. We did put a lot of effort into JetStream between those two versions.

alekseipalshin commented 6 months ago

Thank you, I'll try. When we updated the version from 2.8.4 to 2.9.16, the performance became much better.
Although the behavior seems strange, we have many other streams and there are no problems with them.

MilkyWay-core commented 1 month ago

Hi! I'm a part of the team. We upgraded the cluster to version 2.10.16 but are still having problems.

neilalexander commented 1 month ago

Please can you provide some context? At a minimum updated stream info and consumer info, updated statistics/graphs, any other relevant configuration changes, etc.

MilkyWay-core commented 1 month ago

We have detected high disk write load on some servers' NATS clusters. Restarting helps resolve the issue

The cluster with 3 nodes v2.10.16. It has 8 jet streams, and all streams have the same options

stream info

          Replicas: 3
          Storage: File

Options:

         Retention: Limits
   Acknowledgments: true
    Discard Policy: Old
  Duplicate Window: 2m0s
        Direct Get: true
 Allows Msg Delete: true
      Allows Purge: true
    Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: 15m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

              Name: nats-v2-10
            Leader: nats-v2-10-1
           Replica: nats-v2-10-0, current, seen 57ms ago
           Replica: nats-v2-10-2, current, seen 56ms ago

State:

          Messages: 23,913
             Bytes: 11 MiB
    First Sequence: 239,457,256 @ 2024-10-18 09:37:24 UTC
     Last Sequence: 239,481,168 @ 2024-10-18 09:52:24 UTC
  Active Consumers: 11
Number of Subjects: 10

consumer info

Configuration:

                Name: bid_did_active_v1_ru_prod
           Pull Mode: true
      Filter Subject: bids.bid_did_active.v1.ru.prod
      Deliver Policy: All
          Ack Policy: Explicit
            Ack Wait: 6.00s
       Replay Policy: Instant
  Maximum Deliveries: 5
   Max Waiting Pulls: 512

Cluster Information:

                Name: nats-v2-10
              Leader: nats-v2-10-1
             Replica: nats-v2-10-0, current, seen 23ms ago
             Replica: nats-v2-10-2, current, seen 23ms ago

State:

  Last Delivered Message: Consumer sequence: 44,196,707 Stream sequence: 239,491,181 Last delivery: 38ms ago
    Acknowledgment Floor: Consumer sequence: 44,196,707 Stream sequence: 239,491,181 Last Ack: 24ms ago
        Outstanding Acks: 0
    Redelivered Messages: 0
    Unprocessed Messages: 0
           Waiting Pulls: 2 of maximum 512

metrics

Before restart > 100Mb/s and after < 2Mb/s

image

Memory

image

Load profile doesn't change

image
MilkyWay-core commented 3 weeks ago

After 12 days. We see a slow increase in disk writes per day

image

Load profile

image
MilkyWay-core commented 3 weeks ago

@neilalexander do you need more information?