nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.74k stars 1.4k forks source link

Feature request: improve user story for NATS route monitoring #998

Open ricbartm opened 5 years ago

ricbartm commented 5 years ago

Feature Requests

I'd like NATS Server /routez entrypoint to return also the number of configured (expected) number of routes.

Use Case:

We have several NATS clusters (clustered mode) of different size. Following NATS recommendations for RAFT consensus the cluster is formed by an odd number of nodes and their side is usually 3 or 5 nodes per cluster. Each node has a route to the other 2/4 nodes in the cluster.

We are trying to get visibility on nodes disconnecting from the cluster. We have the HTTP endpoint enabled and we are querying the curl http://localhost:8222/routez entrypoint.

This entrypoint exposes a metric num_routes which is the number of connected routes, as well as a routes which is a list of routes with the information for each of them.

I miss a straightforward way to get the number of configured routes in current configuration vs the number of active routes. I find this useful for:

Proposed Change:

Return a new key configured_routes (or similar) that returns the len() of the number of items of the routes section of the config file. This would not break current folks integrations and exposes new information. Also, each routes[] element would have a new boolean key is_connected that reflects route status. Routes would be always returned, rather than only the active ones.

{
  "server_id": "qXXyvd0Gj6VCRSBsuQ1pQB",
  "now": "2019-05-17T10:51:43.052369911Z",
  "configured_routes": 1,
  "num_routes": 0,
  "routes": [
    {
      "rid": 18121,
      "remote_id": "ntLEv93GBlR8t2WQUsaIjq",
      "did_solicit": true,
      "is_configured": true,
      "is_connected": false,
      "ip": "REDACTED",
      "port": 51460,
      "pending_size": 0,
      "in_msgs": 37256,
      "out_msgs": 37256,
      "in_bytes": 2198248,
      "out_bytes": 6184594,
      "subscriptions": 4
    }
  ]
}

Who Benefits From The Change(s)?

Anybody that wants to have more visibility on the NATS routes.

Alternative Approaches

I'd be happy to hear what other possibilities I have for achieving the visibility on sudden route disconnection which may reflect impact on the cluster health.

One option is performing a timeshift() function over the metric, but the current result for alarm evaluation depends a lot of the previous period that we evaluate to, affecting it's reliability (e.g. if the route has been disconnected for 90 minutes and I compare now() value with 60 min ago, the result is the same, so it would be considered "good" at a given point and it would auto-resolve. Does it make sense?). I think this approach also depends much on the monitoring solution used.

ripienaar commented 5 years ago

I too would like this extra information.

But I think to clarify - NATS Server does not care how many nodes you have active and NATS Servers disconnecting from each other has no bearing on the NATS Streaming Server.

It's the up and actively communicating NATS Streaming Server instances that should be an odd number and its THAT cluster size - desired and active - that you should monitor, not the underlying NATS Server connections.

ricbartm commented 5 years ago

@ripienaar thanks for the answer.

Yes, my proposal may have been highly influenced by the fact that we run NATS Streaming which contain a copy of NATS Server, so our number of NATS Servers instances is exactly the same to number of NATS Streaming instances. The mileage of other folks may vary and it makes sense to me what you said:

It's the up and actively communicating NATS Streaming Server instances that should be an odd number and its THAT cluster size - desired and active - that you should monitor, not the underlying NATS Server connections.

I've created the ticket in this project rather than nats-server-streaming because I understood from the code I saw that the entrypoint and the route information belongs to NATS Server.

Now that you clarified my error in the understanding (server != streaming) I have these questions:

Any additional guidance will be welcome. Thanks!

ripienaar commented 5 years ago

They may contain the NATS instance but they are seperate and a NATS Server disconnecting does not tell you the Streaming Server is disconnected and visa versa. They are separate things - NATS is the ethernet your Streaming Server sits on.

Imagine you have a database cluster - they communicate via ethernet but are unaware of the ethernet topology - multi pathing, bridges, switches etc - the DB clustering only care that the right amount of DBs can communicate at a time with whatever happens to be going on with the ethernet.

Much the same way Streaming is ontop of the communications fabric (NATS Server) and what you have to monitor for your case is if the Streaming Servers are where you need them to be and the right amount. Of course no harm in monitoring both - but Streaming has no clue about the underlying network