Cluster performance with file store and sync publish

Hi! We have a question about nats-streaming cluster performance.

Our requirements:

Persist incoming messages.
Do it within a cluster with an asynchronous replication.
Do it quite fast (at least not that slow as we have).

Our environment:

3 physical nodes in the same data center.
Latest nats-streaming image on each node.
Cluster configuration according to documentation.
File store mode.

The question is: what performance (for 1 pub + 1 sub) can we expect with such requirements and environment?

We didn't really find any benchmark results for cluster with sync publish, except #520. In this issue all 3 nodes in cluster are on the same host and publish is sync (stan.go v0.3.4 at that time had sync publish by default). And those results are far from what we have now. We haven't seen such numbers not even close. Are they just normal and we do something completely wrong? If yes, can you advise us please where to find a problem, what parameters are responsible for this? If no, can you please take us a real example of cluster performance with sync publish.

Our nodes configuration after some experiments which gave us like ~400 msgs/sec:

# NATS specific configuration
host: 10.46.172.12
port: 4222
http_port: 5555
log_file: "/var/lib/nats_streaming/log/nats-server.log"
cluster {
  listen: 10.46.172.12:6222
  routes: ["nats://10.46.172.13:6222", "nats://10.46.172.15:6222"]
}

# NATS Streaming specific configuration
streaming {
  id: test-cluster-1
  store: "file"
  store_limits: {
    max_inactivity: "1m"
    max_msgs: 100000
    max_bytes: 0
    max_age: "0" 
  }
  file_options: {
    slice_max_msgs: 10000
    slice_max_bytes: 0 
    slice_max_age: "0"
    descriptors_limit: 0
  }
  dir: /var/lib/nats_streaming/data
  file: {
    sync: false
    buffer_size: 10512MB
  }
  cluster {
    node_id: "a"
    peers: [ "b", "c" ]
    sync: false
    log_cache_size: 10124
    log_path: "/var/lib/nats_streaming/cluster_log"
    log_snapshots: 10
    trailing_logs: 100000
    raft_logging: false
  }
}

How we run benchmarks:

./stan-bench -s "nats://10.46.172.12, nats://10.46.172.13, nats://10.46.172.15" -c test-cluster-1 -np 1 -ns 1 -n 100000 -ms 1024 -io -sync stan-bench

Some info:

nats-streaming-server version 0.16.2, nats-server version 2.0.4.
Sync benchmarks on clear cluster shows ~500-600 msgs/sec.
Benchmarks with default configuration were about ~100 msgs/sec.
If to run bench without sync rate is ~30000 msgs/sec so it's not a network issue.
After several mixed benchmark launches (sync/async) performance with configuration provided above degraded to 30 msgs/sec and finally resulted into "publish ack timeout".
There are no IO waits on servers. No warnings or errors on all nodes.
Avg CPU is ~1 core on each node.

Giving benchmark number is pointless. Here is a good example of why: https://github.com/nats-io/nats-streaming-server/issues/968. You will see in this issue opened only few days ago how the same executable, same test, on similar os/kernel, with hardware that are supposed to be very similar results in different outcomes.

As stated in that issue, running with -sync will be slow. You are sending 1 message at a time, the server receives it, persist on disk/replicates it in the 3 nodes cluster, send it to the consumer, then sends back the ACK to the publisher so that it can then send the next. You can get more out of it maybe by having the bench use more than 1 connection to produce messages so that server can handle multiple published messages at the same time.

As for degradation after repeated runs, I wonder if this could have to do with the changing of these params:

    log_snapshots: 10
    trailing_logs: 100000

You could try without those.

Also, you have set a rather low limit on the file slice count and very low channel inactivity. Once a file slice is full, a new one is created. Anytime you hit a limit, the server has to update the first slice to indicate that a message is removed. When that file slice will be "empty", then it can be removed.

Also just to see if that improves perf (but not an ultimate solution), once you figure out which node is the leader (you can check logs for server became leader, performing leader promotion actions, but keep in mind that a server may become leader, then lose leadership, reacquire later, etc - or you can check from monitoring endpoint (https://docs.nats.io/nats-streaming-concepts/monitoring/endpoints#serverz)), you could try to run the stan-bench with only the url of that server. The client library would automatically get the urls of the rest of the cluster after connecting (which would allow to reconnect), but that will minimize the number of hops the messages have to go through. Again, I don't think that this is the reason for the slow performance, but worth a try.

nats-io / nats-streaming-server

Cluster performance with file store and sync publish #971