nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16.04k stars 1.42k forks source link

[FIXED] LeafNode's queue group load balancing and Sublist.NumInterest #5982

Closed kozlovic closed 1 month ago

kozlovic commented 1 month ago

While writing the test, I needed to make sure that each server in the hub has registered interest for 2 queue subscribers from the same group. I noticed that Sublist.NumInterest() (that I was invoking from Account.Interest() was returning 1, even after I knew that the propagation should have happened. It turns out that NumInterest() was returning the number of queue groups, not the number of queue subs in all those queue groups.

For the leafnode queue balancing issue, the code was favoring local/routed queue subscriptions, so in the described issue, the message would always go from HUB1->HUB2->LEAF2->QSub instead of HUB1->LEAF1->QSub.

Since we had another test that was a bit reversed where we had a HUB and LEAF1<->LEAF2 connecting to HUB and a qsub on both HUB and LEAF1 and requests originated from LEAF2, and we were expecting all responses to come from LEAF1 (instead of the responder on HUB), I went with the following approach:

If the message originates from a client that connects to a server that has a connection from a remote LEAF, then we pick that LEAF the same as if it was a local client or routed server. However, if the client connects to a server that has a leaf connection to another server, then we keep track of the sub but do not sent to that one if we have local or routed qsubs.

This makes the 2 tests pass, solving the new test and maintaining the behavior for the old test.

Resolves #5972

Signed-off-by: Ivan Kozlovic ivan@synadia.com

kozlovic commented 1 month ago

@neilalexander I believe there was an issue with Sublist.NumInterest for queue subs since it looked like it was simply counting the number of groups, not the total number of queue subscriptions. Let me know if I misunderstood the intent.

@derekcollison Please review the PR description and see if the choice I made is ok.

kozlovic commented 1 month ago

You can review the first commit for the leafnode/sublist issues. The second is simply a bunch of missing "defer nc.Close()" and the likes.

neilalexander commented 1 month ago

Something else that's just occurred to me is that NumInterest() was never back ported into 2.10.x, so if there's a problem on those versions too (as opposed to just on main), it's probably because of the Account.Interest() doing len(res.psubs) + len(res.qsubs).

@derekcollison Don't know whether we want to cherry-pick in NumInterest() into 2.10.x and apply this on top, or if we want to raise a separate PR against the release/v2.10.22 branch to just fix Account.Interest()?

derekcollison commented 1 month ago

@neilalexander let's pull those into 2.10.22 from main once this lands.