A single client causing a timeout on all publishers for different channels on the cluster.

renanberto commented 4 years ago

Hello guys.

I have a scenario where all my publications are timing out from a high cluster consumption load. I would understand that if the cluster had some hardware limitations, however, there is no limitation and there is no lack of resource (Today we have 3 instances in aws with 32GB of 8 cores memory and 1.5TB of SSD disk).

Again, the CLIENT generating time out is normal for the amount of things it wants to do at the same time, but the cluster cannot suffer these consequences affecting all publishers if it still has the resources to use.

My test scenario (which also happens in production) is basically this:

Cluster Configuration:

# Nats Server Local

streaming {
  cluster_id: local-nats
  store: file
  dir: /nats/data

  store_limits {
      max_channels: 0
      max_msgs: 0
      max_bytes: 0
      max_subs: 0
      max_inactivity: "240h"
      max_age: "120h"
  }

  file_options {
    file_descriptors_limit: 1000000
  }
}

The application runs in Docker, with this compose:

nats-local:
  image: nats-streaming:0.16.2  
  container_name: "nats-local"
  hostname: "nats"
  command: "-c /nats/nats-local.conf -m 8222"
  ports:
    - 4222:4222
    - 6222:6222
    - 8222:8222
  volumes:
    - /home/renan/projects/local/nats-local:/nats:rw

The application we use to do the test is two, the first one that generates the load with the number of clients/channels:

package main

import (
    "fmt"
    "log"

    "sync"

    nats "github.com/nats-io/nats.go"
    stan "github.com/nats-io/stan.go"
)

const (
    tenant         = "goa-timeout-tester-1%d"
    serverURL      = "nats://ANOTHER_SERVER(NOT-LOCAL-HOST):4222"
    clusterID      = "local-nats"
    channel        = "goa.timeout.tester.publisher-1%d.%d"
    publisherCount = 35
    clientCount    = 100
    messageCount   = 1000
)

func main() {
    var wg sync.WaitGroup
    wg.Add(clientCount)

    for client := 0; client < clientCount; client++ {
        go func(client int) {
            defer wg.Done()

            // Nats connection
            natsConnection, err := nats.Connect(serverURL, nats.MaxReconnects(-1))
            if err != nil {
                panic(err)
            }

            // Streaming connection
            connectionsOptions, err := stan.Connect(clusterID, fmt.Sprintf(tenant, client), stan.NatsConn(natsConnection), stan.Pings(10, 100))
            if err != nil {
                panic(err)
            }

            wg.Add(publisherCount)
            for p := 0; p < publisherCount; p++ {
                go publisher(client, p, connectionsOptions, &wg)
            }

        }(client)
    }

    wg.Wait()
}

func publisher(client, publisher int, stanConnection stan.Conn, wg *sync.WaitGroup) {
        defer wg.Done()
    log.Printf("publication started on channel: %s\n", fmt.Sprintf(channel, client, publisher))

    for messages := 0; messages < messageCount; messages++ {
        uid, err := stanConnection.PublishAsync(fmt.Sprintf(channel, client, publisher), []byte(`Message of publisher`), func(ackedNuid string, err error) {
            if err != nil {
                log.Printf("Warning: error publishing msg id %s: %v\n", ackedNuid, err.Error())
            }
        })

        if err != nil {
            log.Println(uid, err)
        }
    }

    log.Printf("publication finished on channel: %s\n", fmt.Sprintf(channel, client, publisher))

}

The other application is EQUAL however, the number of messages sent is much larger (just to generate the exception.)

const (
    tenant         = "goa-timeout-tester-1%d"
    serverURL      = "nats://ANOTHER_SERVER(NOT-LOCAL-HOST):4222"
    clusterID      = "local-nats"
    channel        = "goa.timeout.tester.publisher-1%d.%d"
    publisherCount = 35
    clientCount    = 1
    messageCount   = 100000
)

The error seems to be generic: Screenshot_20191114_122537

Versions:

nats-streaming-server: 0.16.2
nats.io: master
stan.io: master

I believe it is normal for the first application to generate Timeout due to the amount of submissions it is doing, however, the second application returns a time out even though the cluster is healthy, does not seem normal to me.

I am available to clarify anything! Thanks in advance.

kozlovic commented 4 years ago

I am sorry I did not have the time to reproduce. Maybe tomorrow. What I don't understand is how you determine that the cluster is healthy? Meaning that if an application gets publish ack timeout and assuming it is not because the publish were lost, then ALL publishers will be affected because if a pub gets an ack timeout it may be because the server/cluster is not able to handle the flow of incoming messages. What I mean is that the leader is accepting all messages from all publishers. If there is too much to do, why would you think that one application should not be affected compared to another? The server "listens" to _STAN.<cluster name>.pub.> and messages for any channel are sent there. Note that having different internal subscriptions is not possible but even if it was, at the end, all data will need to be replicated and if that is the bottle neck then that would not change anything.

renanberto commented 4 years ago

Ivan, thank you very much for your reply.

I imagine a healthy server when there is still hardware to use, in case its limit has not been reached (In this scenario it was not even close). Note that in my case there are 3 m4.2xlarge servers and we do not exceed 50% of available memory and 60% of the available CPU. With the scenario I sent you, you will notice that the NATS server (wherever it is) will still have resources to "spend" as publications start to fail.

kozlovic commented 4 years ago

We need to find out if the Streaming server is really falling behind when processing this influx of messages. My point was that it does not matter if application A is the one sending lots of data and you don't want application B to be impacted. The point I was making is that they both are handled by the same internal subscription. So the issue is as fast can the streaming server handle influx of messages. If I may use a (probably bad analogy), it does not matter how many lanes there is on a highway, you won't be going faster than what your car can go. The hardware may have plenty more of resources, but it could be that the streaming cluster cannot handle more. Again, we need to make sure that these pub ack are really because the server falls behind and not an issue in the client or between the client and server.

renanberto commented 4 years ago

Ivan, again, thanks for the return.

I totally understand what you are telling me, and that's basically my question from the start. Are there any limitations of NATS even if you have the hardware to use?

And if so, where is it?

kozlovic commented 4 years ago

@renanberto Sorry, I have been very busy..

As stated before, make sure that the issue is not in the client, make sure to use master (including this fix: https://github.com/nats-io/stan.go/commit/cd988d9d12268734690970a1899c2a3329a0f811). If that is ok and there is no network issue between the client and server that could explain pub ack timeout, we need to make sure that you are not running the server with debug/trace logging which would affect performance.

If all that is checked and there is still pub ack timeout, it simply means that the influx is too much for the NATS Streaming server. As I tried to explain, all incoming messages for all clients/channels are processed from a single internal NATS subscription. Same for RAFT, replication is mainly handled in a single thread. So CPU usage on a big machine may be reported as low but actual thread is pegging a CPU. Or it could simply be that the replication of messages (leader has to wait for follower nodes to ack) makes it such that even though you are not CPU nor network bound, there is a maximum rate of processing, which means that if you get too many inbound messages then they will start to backup and then cause the client to timeout. You may want to reduce the PubMaxInflight to avoid flooding the server.

kozlovic commented 4 years ago

@renanberto Can you build the server from source and use that in your env? If so, and since you have experienced slow RAFT log deletion, I am wondering if enabling back an option would help in your situation. The line in question is this: https://github.com/nats-io/nats-streaming-server/blob/master/server/raft_log.go#L69

There was an issue in the library we use that caused me to disable that option, but now that it is fixed, and we vendor the release where the fix is, I may be re-enabling this option. Would be great if you could confirm if this is helping in your env or not.

renanberto commented 4 years ago

@kozlovic, sorry for my lack of time to test. This week I will return to you with this answer.

nats-io / nats-streaming-server

A single client causing a timeout on all publishers for different channels on the cluster. #970