nats-io / nats.java

Java client for NATS
Apache License 2.0
563 stars 153 forks source link

Uneven connections rebalance with `forceReconnect` #1184

Closed ajax-koval-d closed 1 month ago

ajax-koval-d commented 1 month ago

Observed behavior

After nats cluster upgrade, we end up with nats connections distributed unevenly between different nodes. To fix this, we want to use forceReconnect.

Problem here is that node resolution in forceReconnect is not fair, which becomes quite evident in cluster with two nats server nodes. Nats cluster upgrade leaves all connection hanging on one node, and then after triggering forceReconnect on each connection, all connection just reconnect on another node, instead of distributing evenly.

As I understand, reason for this is that currently connected server is guaranteed to be at the end of list in a NatsServerPool, meaning, that client will not try to reconnect to the currently connected server.

To fix this I'd propose to shuffle list in NatsServerPool inside of forceReconnect implementation.

Expected behavior

After force reconnect all connections are distributed evenly.

Server and client version

nats-server: v2.10.14 nats.java client: 2.9.1

Host environment

No response

Steps to reproduce

  1. Run nats cluster with 2 servers
  2. Open at least 2 connections to the servers with ignoreDiscoveredServers=true Next, simulating nats server upgrade
  3. Turn off server 1
  4. Wait at least 5 sec
  5. Turn on server 1
  6. Wait until server starts
  7. Turn off server 2
  8. Wait at least 5 sec
  9. Turn on server 2
  10. Wait until servers starts At this point all connections will be on server 1
  11. Trigger forceReconnect on all opened connections

Here's the test case (testRebalanceWithForceReconnect) that demonstrates the issue.

scottf commented 1 month ago

By default it's shuffled and by default on reconnect, the current server is put on the bottom after shuffle.

As it turns out, you can provide your own complete ServerPool implementation and pass it to Options via the serverPool(...) builder method.

If you provide your own implementation, you can extend the NatsServerPool as a starting point. If you make changes that work for you, you are welcome to offer a PR with changes for the system. If you do that, please make sure that the current behavior is the default and you can only modify the behavior in the default implementation with connection Options. The other thing I suppose we could do is provide a full different implementation in our Orbit extension library.

ajax-koval-d commented 1 month ago

@scottf, Thanks for quick response!

I'd like to clarify a bit my idea: what I want is to call Collections.shuffle in a middle of a forceReconnect, right before reconnectImpl, but without putting current server on the bottom, like this. I don't want to change NatsServerPool or reconnectImpl logic in any way, so standard behavior of reconnect, when connection, for example, disconnected due to network issues will be kept intact - currently connected server will be last on the list.

I believe in that case there's no need to introduce different option, because it will only affect forceReconnect call, and to me it seems strange to opt in for unfair distribution here. But if you feel strong about making a separate option, I can provide new implementation of ServerPool and add a new method, shuffle here, which will be NOOP in NatsServerPool.

In any case, I'd be glad to contribute.

By the way, at first I thought about extending NatsServerPool in my project, but unfortunately, listLock is private in here, and getServerList returns a copy, making it essentially impossible to reuse that class.

scottf commented 1 month ago

Ok here is the change to NatsServerPool to make it easier to extend. https://github.com/nats-io/nats.java/pull/1185