Partial data on subscriptions on some servers in cluster

maxboone commented 3 weeks ago

Observed behavior

Using general NATS we notice that we are missing a part of the messages on our subscriptions, our hunch is that this happens when messages are published to any of the five nodes in the cluster and the subscription is connected to a node where that message is not published.

As most messages do go through, we checked whether there are any missing routes within the cluster and see the following:

# /subsz endpoint for each server
for sink, subs in monitoring_subsz.items():
    for sub in subs:
        subject = sub.get("subject")
        if subject.startswith("$") or subject.startswith("_INBOX"):
            continue

        # /routez endpoint for each server
        for source, routes in monitoring_routez.items():
            if source == sink:
                continue
            exists = any(
                subject in route.get("subscriptions_list", [])
                and route.get("remote_name") == sink
                for route in routes
            )
            if not exists:
                print(f"no route from {source} to {sink}\n{sub}\n")

Returning:

no route from nats-4 to nats-3
{'account': '$G', 'subject': 'mutation.update.transaction.*', 'qgroup': 'api_transaction_update_mutation', 'sid': '2', 'msgs': 9, 'cid': 287}

We are mainly wondering if this points towards routes not existing, or is it possible without defect to have these subjects not routed between these servers (i.e. if nats-4 has no one publishing messages on that subject).

Expected behavior

Messages are received throughout the cluster, and there is not a split-server situation. I'm aware that it is an at-most-once delivery system, but this seems like a defect rather than a couple messages being missed while NATS figures out the routing.

Server and client version

server: 2.10.18 client (go): v1.37.0

Host environment

k8s cluster on EKS, specific nodes for NATS.

Steps to reproduce

No response

maxboone commented 3 weeks ago

After a restart of the NATS cluster and all publishers & subscribers yesterday, we noticed that there is an increase in missing messages. For example:

~ cat << EOF | python3                                                                                                                                                                                       import requests

for i in range(0, 5):
    conns = requests.get(f"http://localhost/:{8223 + i}/connz").json().get("connections")
    for c in [x for x in conns if "asset-event-publisher" in x.get("name")]:
        print(f"nats-{i}: {c.get('name')}")

EOF
nats-2: asset-event-publisher-6c9bccf85c-5lzr4
nats-3: asset-event-publisher-6c9bccf85c-scp8v
nats-3: asset-event-publisher-6c9bccf85c-j9c2j
nats-4: asset-event-publisher-6c9bccf85c-pfhzh

The asset-event publishing application is pushing messages to the servers 2, 3 and 4. And the earlier mentioned script provides:

no route from nats-2 to nats-4: asset.*.*.event.*.*.*.*.a.*
no route from nats-2 to nats-4: asset.*.*.event.>
no route from nats-2 to nats-4: asset.*.*.event.>
no route from nats-3 to nats-1: asset.*.*.event.*.*.*.*.a.*
no route from nats-3 to nats-1: asset.*.*.event.*.*.*.*.a.*
no route from nats-3 to nats-1: asset.*.*.event.>

derekcollison commented 3 weeks ago

What does nats server ls report? This needs to be bound to the system context.

maxboone commented 3 weeks ago

What does nats server ls report? This needs to be bound to the system context.

╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                      Server Overview                                                         │
├────────┬────────────┬──────┬─────────┬─────┬───────┬───────┬────────┬─────┬─────────┬───────┬───────┬──────┬───────────┬─────┤
│ Name   │ Cluster    │ Host │ Version │ JS  │ Conns │ Subs  │ Routes │ GWs │ Mem     │ CPU % │ Cores │ Slow │ Uptime    │ RTT │
├────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼───────────┼─────┤
│ nats-2 │ production │ 0    │ 2.10.18 │ yes │ 13    │ 1,029 │     14 │   0 │ 30 MiB  │ 4     │     4 │    0 │ 12h40m12s │ 2ms │
│ nats-3 │ production │ 0    │ 2.10.18 │ yes │ 15    │ 1,023 │     15 │   0 │ 75 MiB  │ 17    │     4 │    0 │ 12h43m16s │ 2ms │
│ nats-1 │ production │ 0    │ 2.10.18 │ yes │ 1     │ 955   │     15 │   0 │ 74 MiB  │ 10    │     4 │    0 │ 12h18m25s │ 2ms │
│ nats-0 │ production │ 0    │ 2.10.18 │ yes │ 8     │ 1,016 │     16 │   0 │ 71 MiB  │ 25    │     4 │    0 │ 12h35m19s │ 7ms │
│ nats-4 │ production │ 0    │ 2.10.18 │ yes │ 1     │ 979   │     14 │   0 │ 33 MiB  │ 0     │     4 │    0 │ 12h18m3s  │ 7ms │
├────────┼────────────┼──────┼─────────┼─────┼───────┼───────┼────────┼─────┼─────────┼───────┼───────┼──────┼───────────┼─────┤
│        │ 1          │ 5    │         │ 5   │ 38    │ 5,002 │      X │     │ 282 MIB │       │       │    0 │           │     │
╰────────┴────────────┴──────┴─────────┴─────┴───────┴───────┴────────┴─────┴─────────┴───────┴───────┴──────┴───────────┴─────╯

╭───────────────────────────────────────────────────────────────────────────────╮
│                              Cluster Overview                                 │
├────────────┬────────────┬───────────────────┬───────────────────┬─────────────┤
│ Cluster    │ Node Count │ Outgoing Gateways │ Incoming Gateways │ Connections │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│ production │          5 │                 0 │                 0 │          38 │
├────────────┼────────────┼───────────────────┼───────────────────┼─────────────┤
│            │          5 │                 0 │                 0 │          38 │
╰────────────┴────────────┴───────────────────┴───────────────────┴─────────────╯

derekcollison commented 3 weeks ago

Thanks, so this clearly inudicates the cluster is not properly formed. Note the routes column, that should be the same number for all servers in a properly formed cluster which is full mesh one-hop for clusters. Since we default to 4 routes per server pair that number should be 16 for all.

So next step would be to review the config files, specifically the cluster section.

If the cluster is not properly formed, that would explain the partial data delivery for subscriptions etc.

maxboone commented 3 weeks ago

So next step would be to review the config files, specifically the cluster section.

If the cluster is not properly formed, that would explain the partial data delivery for subscriptions etc.

Ah, thank you so much, my hunch is then that we configure the cluster as follows:

cluster {
  name: "production"
  listen: 0.0.0.0:6222
  # Authorization for cluster connections
  authorization {
    user: "nats_cluster"
    password: $CLUSTER_PASSWORD
    timeout:  1
  }
  # Routes are actively solicited and connected to from this server.
  # Other servers can connect to us if they supply the correct credentials
  # in their routes definitions from above
  routes = [
    nats://nats_cluster:<snip>@nats-headless:6222
  ]
}

Where nats-headless is a kubernetes service that selects all the 5 NATS pods and returns their IPs. Could that mean that during start-up of a NATS pod where we're seeing a lack of routes that the kubernetes service does not return a full list of the pods?

Meaning, we should rather set the routes to [ pod-0, pod-1, ..., pod-4 ] rather than a single hostname that resolves all IPs?

derekcollison commented 3 weeks ago

I am not too well versed in K8S stuff. Usually we recommend all configs be explicit about all members (including themselves) in the routes definitions.

Looping in @wallyqs from our side who is more well versed in K8S.

neilalexander commented 3 weeks ago

Meaning, we should rather set the routes to [ pod-0, pod-1, ..., pod-4 ] rather than a single hostname that resolves all IPs?

Yes, NATS won't take multiple IPs from a single DNS A/AAAA response and connect to all of them, it will just pick one for each line item and go with that. So you will need to explicitly list each pod.

maxboone commented 3 weeks ago

Well, in that case, my apologies for raising this bug as this is obviously a configuration mishap on our side.

We've been running the bitnami chart for NATS for a while (will do a PR there to fix this) and did make our own modifications w.r.t. unhealthy pods not being resolved by the service.

I guess this is a good moment to do a migration to the NATS k8s charts. Thanks!

derekcollison commented 3 weeks ago

You should take a look at our helm chart to run a NATS system vs the Bitnami one.

https://github.com/nats-io/k8s/tree/main/helm/charts/nats

nats-io / nats-server