cluster_size metric doesn't reflect real cluster size.

lesovsky commented 3 years ago

Defect

v2.6.2 has a new cluster_size metric in /varz and /jsz endpoints. It seems it should show how many nodes are in the cluster.

After initial setup, when comparing cluster_size values from all hosts, I found 1) the leader counts only replicas and doesn't count itself; or 2) sometimes cluster_size on all hosts is less than total number of hosts (at the same time no errors in the logs). This confuses a bit, it may seemed that cluster is degraded. After stopping the leader, a new leader is elected, starting old leader again and now all cluster_size values show the same numbers.

Versions of `nats-server` and affected client libraries used: 2.6.2, 2.6.3

OS/Container environment:

jrei/systemd-ubuntu:18.04
jrei/systemd-ubuntu:20.04
jrei/systemd-debian:9
jrei/systemd-debian:10
jrei/systemd-debian:11

Steps or code to reproduce the issue:

initialize cluster with 3 nodes
curl /jsz or /vars endpoint and compare values of cluster_size metric - values on leader less than on replicas
find leader, stop nats-server, wait election of new leader and start service
re-check cluster_size metric - all values are the same

Expected result:

All values should be equal after cluster initialization

Actual result:

cluster_size on leader is less than on replicas

ripienaar commented 3 years ago

From what I understand reading #2579 this is not supposed to be the total cluster size - it's the jetstream meta group information, this is dynamically calculated based on known JS enabled servers and I am not entirely sure why the leader doesnt count itself - but that's typical for stream/consumer/meta group information, none of them include the leader info as part of cluster info, which I agree is very confusing.

I am not sure what the desired behavior is, but it's on purpose that this number doesnt always match total cluster size since we support mixed mode setups where only some in a cluster has JetStream enabled.

derekcollison commented 3 years ago

We could adjust to count the leader..

lesovsky commented 3 years ago

From the monitoring (observability) point of view, it is important to have metrics which briefly describes health of cluster. I supposed to use "cluster_size" for this purpose, because there is nothing else. It would be nice if in further versions will be added metrics (exposed by http-endpoints) which describe overall health of the cluster and total number of alive/dead nodes.

I found that such info could be found using natscli (mostly from server sub-command) tool, and it would be nice if the same info could be obtained from http.

ripienaar commented 3 years ago

If the CLI can do it so can http. It gets exactly the same data.

The only difference is tbe CLi gather it from the entire fleet and then aggregates the data.

I don’t think we will ever be able to provide all data reliably from every node. That’s not how distributed systems works.

lesovsky commented 3 years ago

The only difference is tbe CLi gather it from the entire fleet and then aggregates the data.

Exactly, aggregates. Natscli provided output about all nodes in table format, and this is impossible to see in http-endpoints output.

I don’t think we will ever be able to provide all data reliably from every node

Anyway, all nodes works over RAFT, it means a particular node could output its current RAFT state. Let me know, if this information already exists in http endpoint (maybe meta_cluster in /jsz, but I'm not sure)

ripienaar commented 3 years ago

Each node has its view in /jsz yes, but it’s not the entire world in there. There are many layers of raft and any given node only have a view of a subset.

lesovsky commented 3 years ago

We could adjust to count the leader..

It would be great.

lesovsky commented 3 years ago

Each node has its view in /jsz yes, but it’s not the entire world in there.

yes, I agreed - it is not entire world, but when requesting the state it is sufficient to get the current state right now, because in long-term measurements we will see a whole picture of how this "world" changes - the number of hosts are the same in the most of time, or number of hosts are changed (and cluster is unstable). Or maybe expose information about the cluster has leader or not (like this is done in etcd with etcd_server_has_leader metric).

ripienaar commented 3 years ago

You can get info about the core nats cluster in routesz - what is you are trying to find exactly?

lesovsky commented 3 years ago

what is you are trying to find exactly

I am looking for a simple metric which shows a number of nodes in the cluster, including the leader.

ripienaar commented 3 years ago

Specifically for JetStream or nats core also? And so do you have a super cluster?

lesovsky commented 3 years ago

No, I have no super clusters (no gateways, no leafs), quite simple setups with 3 or 5 nodes.

Specifically for JetStream or nats core also?

for Jetstream

ripienaar commented 3 years ago

I don’t think we have a single number for that. You would need to count array size or something like that atm

ripienaar commented 3 years ago

fwiw, thoug, in my setup this number seems correct, even on the leader:

[rip@p1-lon]% nats server req jsz --context system.lon |jq .data.meta_cluster.cluster_size
9
9
9
9
9
9
9
9
9

How do you define your routes, do you list all nodes or rely on some dynamic configuration and seeding of routes?

lesovsky commented 3 years ago

fwiw, thoug, in my setup this number seems correct, even on the leader:

Did you try to check after initial setup (as mentioned in the first message)?

I repeated the test again and get the same result (due to which I opened the issue)

$ for i in 1 2 3; do curl -s 192.168.122.1$i:8222/jsz |jq .meta_cluster.leader; done
"nats3"
"nats3"
"nats3"

$ for i in 1 2 3; do curl -s 192.168.122.1$i:8222/jsz |jq .meta_cluster.cluster_size; done
3
3
2

Why 2?

Used config:

# Ansible managed

# HTTP monitoring port
port: 4222
http: 8222
syslog: true
pid_file: /var/lib/nats-server/nats.pid

server_name: nats1
jetstream: true

authorization {
#  default_permissions = {
#    publish = "SANDBOX.*"
#    subscribe = ["PUBLIC.>", "_INBOX.>"]
#  }

  user1 = {
    publish = ">"
    subscribe = ">"
  }

  users = [
    {user: admin, password: "password"}
    {user: user1, password: "password", permissions: $user1 }
  ]
}

accounts: {
    SYS: {
        users: [
            { user: admin, password: password }
        ]
    },
}

system_account: SYS

cluster {
  listen: 0.0.0.0:5222
  name: test-cluster

  # Authorization for route connections plaintext
  authorization {
      user: admin
      password: password
  }

  routes: [
     "nats-route://admin:password@nats2:5222"
     "nats-route://admin:password@nats3:5222"
  ]

}

jetstream: {
    store_dir: /var/lib/nats-server
    max_memory_store: 1GB
    max_file_store: 1GB
}

How do you define your routes, do you list all nodes or rely on some dynamic configuration and seeding of routes?

Please read carefully the first message I wrote, there are described steps to reproduce.

ripienaar commented 3 years ago

Please read carefully the first message I wrote, there are described steps to reproduce.

Clearly additional information is required since as demonstrated my environment does not exhibit this behavior, so additional questions or explorations are asked. As you can see your original question does NOT answer these questions.

lesovsky commented 3 years ago

You can found answers in the config attached above.

lesovsky commented 3 years ago

I thought maybe the issue tied with a wrong configuration (explained and fixed here), but unfortunately after adjusting config, behavior is not changed - cluster_size is smaller than real number of hosts in the cluster.

root@nats3:~# nats server list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                      Server Overview                                                      │
├───────┬──────────────┬───────────┬─────────┬─────┬───────┬──────┬────────┬─────┬────────┬─────┬──────┬────────┬───────────┤
│ Name  │ Cluster      │ IP        │ Version │ JS  │ Conns │ Subs │ Routes │ GWs │ Mem    │ CPU │ Slow │ Uptime │ RTT       │
├───────┼──────────────┼───────────┼─────────┼─────┼───────┼──────┼────────┼─────┼────────┼─────┼──────┼────────┼───────────┤
│ nats1 │ test-cluster │ 0.0.0.0   │ 2.6.3   │ yes │ 0     │ 132  │ 2      │ 0   │ 19 MiB │ 0.0 │ 0    │ 34m8s  │ 978.959µs │
│ nats2 │ test-cluster │ 0.0.0.0   │ 2.6.3   │ yes │ 0     │ 132  │ 2      │ 0   │ 22 MiB │ 0.0 │ 0    │ 34m8s  │ 961.162µs │
│ nats3 │ test-cluster │ 0.0.0.0   │ 2.6.3   │ yes │ 1     │ 132  │ 2      │ 0   │ 20 MiB │ 0.0 │ 0    │ 34m8s  │ 871.655µs │
├───────┼──────────────┼───────────┼─────────┼─────┼───────┼──────┼────────┼─────┼────────┼─────┼──────┼────────┼───────────┤
│       │ 1 Clusters   │ 3 Servers │         │ 3   │ 1     │ 396  │        │     │ 60 MiB │     │ 0    │        │           │
╰───────┴──────────────┴───────────┴─────────┴─────┴───────┴──────┴────────┴─────┴────────┴─────┴──────┴────────┴───────────╯

╭─────────────────────────────────────────────────────────────────────────────────╮
│                                Cluster Overview                                 │
├──────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster      │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├──────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ test-cluster │ 3          │ 0                 │ 0                 │ 1           │
├──────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│              │ 3          │ 0                 │ 0                 │ 1           │
╰──────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯
root@nats3:~# curl -s 127.0.0.1:8222/jsz |jq .meta_cluster.cluster_size
2

Let me know if you need extra information or tests which have to be made.

ripienaar commented 3 years ago

If you run your servers in debug mode do you see any logs like Adjusting JetStream cluster etc?

lesovsky commented 3 years ago

No, I found nothing similar to this. I collected per-server debug logs and put them to google drive, maybe you can find something useful there. In that setup, nats1 and nats2 shows cluster_size=2, and nats2 is the leader.

ripienaar commented 3 years ago

OK, I do see it's doing some dynamic peer gathering/sizing, which my clusters do not do and no doubt the bug is in there.

@derekcollison what circumstances would make it log 'intitial peers' and then gathering peer state from leader etc? My own clusters do not do this if I start them so I think this is doing some dynamic sizing? Maybe an accounting bug there. Could it be because routes do not list all servers?

@lesovsky please add all 3 servers to all route blocks on all servers.

nats1.out:Nov 09 09:16:55 nats1 nats-server[2871]: JetStream cluster checking for stable cluster name and peers
nats1.out:Nov 09 09:16:55 nats1 nats-server[2871]: JetStream cluster initial peers: [RztkeQup]
nats1.out:Nov 09 09:16:55 nats1 nats-server[2871]: RAFT [RztkeQup - _meta_] Update peers from leader to map[RztkeQup:0xc00015fe90 SRLRpmYS:0xc0003bbe60]
nats2.out:Nov 09 09:16:55 nats2 nats-server[2871]: JetStream cluster checking for stable cluster name and peers
nats2.out:Nov 09 09:16:55 nats2 nats-server[2871]: JetStream cluster initial peers: [SRLRpmYS]
nats3.out:Nov 09 09:16:55 nats3 nats-server[2870]: JetStream cluster checking for stable cluster name and peers
nats3.out:Nov 09 09:16:55 nats3 nats-server[2870]: JetStream cluster initial peers: [fvTBnQC7]
nats3.out:Nov 09 09:16:55 nats3 nats-server[2870]: RAFT [fvTBnQC7 - _meta_] Update peers from leader to map[RztkeQup:0xc0001677a0 SRLRpmYS:0xc000167000]

lesovsky commented 3 years ago

please add all 3 servers to all route blocks on all servers.

Defined all hosts in the routes (on all servers) and now cluster_size it the same everywhere and equal 3. Repeated the test and got the same result.

Hmm... I thought local server shouldn't be specified in the routes, but seems this is invalid?

ripienaar commented 3 years ago

It's valid to specify it, it will ignore it and not try to connect to it.

JS will take signal about the meta group size from the routes though, so that's why it went into trying to figure this out dynamically.

We might be able to handle the case where the server itself isnt in the route list as implied signal to JS Raft layer to include itself in the list? wdyt @derekcollison

derekcollison commented 1 year ago

Closing for now, feel free to re-open if needed.

nats-io / nats-server