Closed maxboone closed 3 weeks ago
After a restart of the NATS cluster and all publishers & subscribers yesterday, we noticed that there is an increase in missing messages. For example:
~ cat << EOF | python3 import requests
for i in range(0, 5):
conns = requests.get(f"http://localhost/:{8223 + i}/connz").json().get("connections")
for c in [x for x in conns if "asset-event-publisher" in x.get("name")]:
print(f"nats-{i}: {c.get('name')}")
EOF
nats-2: asset-event-publisher-6c9bccf85c-5lzr4
nats-3: asset-event-publisher-6c9bccf85c-scp8v
nats-3: asset-event-publisher-6c9bccf85c-j9c2j
nats-4: asset-event-publisher-6c9bccf85c-pfhzh
The asset-event publishing application is pushing messages to the servers 2, 3 and 4. And the earlier mentioned script provides:
no route from nats-2 to nats-4: asset.*.*.event.*.*.*.*.a.*
no route from nats-2 to nats-4: asset.*.*.event.>
no route from nats-2 to nats-4: asset.*.*.event.>
no route from nats-3 to nats-1: asset.*.*.event.*.*.*.*.a.*
no route from nats-3 to nats-1: asset.*.*.event.*.*.*.*.a.*
no route from nats-3 to nats-1: asset.*.*.event.>
What does nats server ls
report? This needs to be bound to the system context.
What does
nats server ls
report? This needs to be bound to the system context.
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Server Overview │
├────────┬────────────┬──────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬───────────┬─────┤
│ Name │ Cluster │ Host │ Version │ JS │ Conns │ Subs │ Routes │ GWs │ Mem │ CPU % │ Cores │ Slow │ Uptime │ RTT │
├────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼───────────┼─────┤
│ nats-2 │ production │ 0 │ 2.10.18 │ yes │ 13 │ 1,029 │ 14 │ 0 │ 30 MiB │ 4 │ 4 │ 0 │ 12h40m12s │ 2ms │
│ nats-3 │ production │ 0 │ 2.10.18 │ yes │ 15 │ 1,023 │ 15 │ 0 │ 75 MiB │ 17 │ 4 │ 0 │ 12h43m16s │ 2ms │
│ nats-1 │ production │ 0 │ 2.10.18 │ yes │ 1 │ 955 │ 15 │ 0 │ 74 MiB │ 10 │ 4 │ 0 │ 12h18m25s │ 2ms │
│ nats-0 │ production │ 0 │ 2.10.18 │ yes │ 8 │ 1,016 │ 16 │ 0 │ 71 MiB │ 25 │ 4 │ 0 │ 12h35m19s │ 7ms │
│ nats-4 │ production │ 0 │ 2.10.18 │ yes │ 1 │ 979 │ 14 │ 0 │ 33 MiB │ 0 │ 4 │ 0 │ 12h18m3s │ 7ms │
├────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼───────────┼─────┤
│ │ 1 │ 5 │ │ 5 │ 38 │ 5,002 │ X │ │ 282 MIB │ │ │ 0 │ │ │
╰────────┴────────────┴──────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴───────────┴─────╯
╭───────────────────────────────────────────────────────────────────────────────╮
│ Cluster Overview │
├────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ production │ 5 │ 0 │ 0 │ 38 │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ │ 5 │ 0 │ 0 │ 38 │
╰────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯
Thanks, so this clearly inudicates the cluster is not properly formed. Note the routes column, that should be the same number for all servers in a properly formed cluster which is full mesh one-hop for clusters. Since we default to 4 routes per server pair that number should be 16 for all.
So next step would be to review the config files, specifically the cluster section.
If the cluster is not properly formed, that would explain the partial data delivery for subscriptions etc.
So next step would be to review the config files, specifically the cluster section.
If the cluster is not properly formed, that would explain the partial data delivery for subscriptions etc.
Ah, thank you so much, my hunch is then that we configure the cluster as follows:
cluster {
name: "production"
listen: 0.0.0.0:6222
# Authorization for cluster connections
authorization {
user: "nats_cluster"
password: $CLUSTER_PASSWORD
timeout: 1
}
# Routes are actively solicited and connected to from this server.
# Other servers can connect to us if they supply the correct credentials
# in their routes definitions from above
routes = [
nats://nats_cluster:<snip>@nats-headless:6222
]
}
Where nats-headless
is a kubernetes service that selects all the 5 NATS pods and returns their IPs. Could that mean that during start-up of a NATS pod where we're seeing a lack of routes that the kubernetes service does not return a full list of the pods?
Meaning, we should rather set the routes to [ pod-0, pod-1, ..., pod-4 ]
rather than a single hostname that resolves all IPs?
I am not too well versed in K8S stuff. Usually we recommend all configs be explicit about all members (including themselves) in the routes definitions.
Looping in @wallyqs from our side who is more well versed in K8S.
Meaning, we should rather set the routes to
[ pod-0, pod-1, ..., pod-4 ]
rather than a single hostname that resolves all IPs?
Yes, NATS won't take multiple IPs from a single DNS A/AAAA response and connect to all of them, it will just pick one for each line item and go with that. So you will need to explicitly list each pod.
Well, in that case, my apologies for raising this bug as this is obviously a configuration mishap on our side.
We've been running the bitnami chart for NATS for a while (will do a PR there to fix this) and did make our own modifications w.r.t. unhealthy pods not being resolved by the service.
I guess this is a good moment to do a migration to the NATS k8s charts. Thanks!
You should take a look at our helm chart to run a NATS system vs the Bitnami one.
Observed behavior
Using general NATS we notice that we are missing a part of the messages on our subscriptions, our hunch is that this happens when messages are published to any of the five nodes in the cluster and the subscription is connected to a node where that message is not published.
As most messages do go through, we checked whether there are any missing routes within the cluster and see the following:
Returning:
We are mainly wondering if this points towards routes not existing, or is it possible without defect to have these subjects not routed between these servers (i.e. if
nats-4
has no one publishing messages on that subject).Expected behavior
Messages are received throughout the cluster, and there is not a split-server situation. I'm aware that it is an at-most-once delivery system, but this seems like a defect rather than a couple messages being missed while NATS figures out the routing.
Server and client version
server: 2.10.18 client (go): v1.37.0
Host environment
k8s cluster on EKS, specific nodes for NATS.
Steps to reproduce
No response