Closed velimir closed 4 years ago
Thank you taking the time to investigate a few different options :) My only concern about this is that it would increase memory footprint per federation link.
So in the alternative solution, would we have two extra processes per link?
Yes, the alternative solution would require 2 additional processes: upstream channel for commands and the “cmd” process. This however will even further reduce delay introduced by binding operations (as they will be dispatched completely asynchronously). Construction of the operation itself might be costly. Binding op requires a cluster name. If federation link is running on a node that is not a part of a cluster this results in hostname resolution, which is not instantaneous. Hostname resolution on the other hand can be resolved by simple caching.
Are there any numbers you can share around the latency improvements? I certain don't think that we should be reinventing hostname resolver functionality such as result caching. Two processes per link can introduce a fairly substantial amount of overhead.
Here are some numbers.
Federated exchange between node A1 (upstream) and B1: federated
.
{
"ack-mode": "no-ack",
"bind-nowait": true,
"prefetch-count": 10000,
"trust-user-id": true,
"uri": "amqp://usr:pwd@A1"
}
Producer sends messages to federated exchange on node A1.
Consumer receives messages on B1 that are routed to a queue fedq
node B1.
In the middle of the process a client connects to a node B1 and creates 9000 bindings for exchange federated
to a queue test-q
.
diff --git a/src/rabbit_nodes.erl b/src/rabbit_nodes.erl
index b363ed4f0..6d3d7fb3c 100644
--- a/src/rabbit_nodes.erl
+++ b/src/rabbit_nodes.erl
@@ -73,7 +73,16 @@ is_process_running(Node, Process) ->
cluster_name() ->
rabbit_runtime_parameters:value_global(
- cluster_name, cluster_name_default()).
+ cluster_name, cached(default_cluster_name, fun cluster_name_default/0)).
+
+cached(Key, Fun) ->
+ case get (Key) of
+ undefined ->
+ Val = Fun(),
+ put(Key, Val);
+ Val ->
+ Val
+ end.
cluster_name_default() ->
{ID, _} = parts(node()),
There's still a little blip on the last picture, that can be avoided using the alternative approach that I mentioned.
rabbit_nodes:cluster_name_default/0
profiling results:
{[{{rabbit_nodes,cluster_name_default,0}, 11072,27881.509, 59.245},
{undefined, 0, 0.004, 0.000}],
{ {rabbit_net,hostname,0}, 11072,27881.513, 59.245}, %
[{{inet,gethostbyname,1}, 11072,21261.976, 47.657},
{{inet,gethostname,0}, 11072, 6560.262, 96.580},
{{rabbit_nodes,make,1}, 1, 0.026, 0.002},
{{inet,gethostbyname_tm_native,4}, 0, 0.004, 0.002}]}.
it's actually doesn't matter if node is in the cluster or node, the default cluster name is called regardless.
cluster_name() ->
rabbit_runtime_parameters:value_global(
cluster_name, cluster_name_default()).
@michaelklishin one more thing... the small blip here is small only because the message rate is small and the cost of the channel call is relatively cheap as this was tested on a localhost. A delay of a network round-trip introduced on a process handing thousands of messages per second would be much more noticeable. This makes me think that 2nd option would perform much better for which I’m happy to submit a PR.
Hey @michaelklishin is there anything we can do here to make a progress on this PR?
If a separate process is a no-go from your perspective we can leave it as is. Also to lower an impact for current users, we can change it so that additional channel is created only if "bind-nowait": true
is set for the federation. Otherwise there's no much difference. With nowait
option amqp_channel
returns to the caller as soon as bind message is casted to the rabbit_writer
, therefore the flow of messages will only be interrupted for a time of: <N of bindings> * (<gen_server call> + <exchange process bindings state update>)
, which should be a relatively fast.
Both approaches make sense but I'm concerned that up to two extra processes per link can be problematic since all links currently reside on a single host. Our team will have to discuss this.
The rabbit_nodes:cluster_name_default/0
hostname caching part makes sense but we should have a reasonable expiration period, e.g. 30 to 60 seconds.
@michaelklishin thanks, please let me know if I can be in any help here.
FTR, this has not been forgotten.
Thank you @michaelklishin :)
Backported to v3.8.x
.
Thank you for addressing the comments @michaelklishin! Unfortunately didn't have much time this week.
I backported this to v3.7.x
and there may be a 3.7.26
release that would include it.
Proposed Changes
Using
bind-nowait
and separate channel for bindings operations allows to avoid message forwarding delays on the federation link.During routing changes, exchange federation process is getting blocked waiting upstream to update bindings. This can cause a relatively big latency on the message flow.
Types of Changes
Checklist
CONTRIBUTING.md
documentFurther Comments
Another option to implement this (and I'm leaning towards it) is to have a dedicated process (ex
rabbit_federation_exchange_link_cmd
), that would be responsible for bindings tracking.rabbit_federation_exchange_link
would forward bind/unbind commands to therabbit_federation_exchange_link_cmd
process to avoid blocking itself.What do you think?