Closed megwill4268 closed 1 year ago
Hello,
thanks for your contribution and sorry for the late reply!
You mentioned that you suffered problems with an old version of the plugin. Did you observe similar issues with a new one? The latest versions of the plugin already replicate the cache on multiple nodes hence mitigating the above problem.
Yes, I was able to get the same issue with a manual test on the new version by bringing up a single node, declaring the exchange and then adding two more nodes, then taking down the original node that is the only one with the cache table declared on in.
My understanding of what I am seeing is that the dedup cache table in mnesia is only added when the exchange is first declared, therefore any nodes that join later don't get that table replicated to them.
Right, thanks for providing info on how to reproduce the issue.
The plugin clearly does not handle situations in which the cluster size changes or the nodes are entirely replaced. This is becoming a common issue with VM provides such as AWS EC2.
Your code change is legitimate but the implementation is right now very problematic. We cannot afford to tax every route
request for the sake of mitigating a rare issue.
Nevertheless, there is an easy fix for that. We can subscribe to events from the BEAM and react upon the creation of new nodes via the net_kernel:monitor_nodes
.
Whenever we receive a nodeup
notification, we check for each cache its distribution across the cluster. If the cache is not distributed onto at least 2/3 of the nodes, we replicate it accordingly. Your logic does most of the job, we just need to remove the check on the route
call and add the node monitoring logic.
You can add such logic in the cache_manager
gen_server
which is were it belongs. Would you mind updating the PR or would you prefer me to take over and complete it?
Ah, that is a much better solution. I don't mind trying to do it, but I don't know how to get the name of the cache tables to copy over. Only thing I have thought of so far (found googling) was to maybe try and use Mnesia.schema to get back the table info to parse into table names and then grab all the ones that start with "cache_". Does that sound like the right path or is there another easier way to grab that info?
No need for that. The cache_manager
tracks the list of caches in its own internal table: message_deduplication_caches
.
You can simply iterate over the content of that table.
I didn't test it yet. Thank you for explaining so much.
On Sat, May 28, 2022, 10:06 AM Matteo Cafasso @.***> wrote:
@.**** requested changes on this pull request.
Added few comments with explanations. I hope they help.
I guess you did not manage to test the logic right?
— Reply to this email directly, view it on GitHub https://github.com/noxdafox/rabbitmq-message-deduplication/pull/82#pullrequestreview-988530232, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH7WB5AP7NWELR56LWMMSDVMIY73ANCNFSM5SG3UEQA . You are receiving this because you authored the thread.Message ID: @.*** com>
So I made the changes you suggested and started testing, it isn't working though. I am kinda stumped on what to do.
What kind of errors are you experiencing?
@megwill4268, do you plan to continue on this or shall I take over?
Hi Matteo,
I don't have time to work on this right now so you should probably take over.
On Thu, Jul 21, 2022 at 4:03 PM Matteo Cafasso @.***> wrote:
@megwill4268 https://github.com/megwill4268, do you plan to continue on this or shall I take over?
— Reply to this email directly, view it on GitHub https://github.com/noxdafox/rabbitmq-message-deduplication/pull/82#issuecomment-1191933362, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH7WB6DT7APGRVMKY2EV5LVVG3LZANCNFSM5SG3UEQA . You are receiving this because you were mentioned.Message ID: @.***>
I will merge this and continue from there. Thanks for your contribution!
A new release supporting cluster growth has been released: 0.6.0
.
Please read carefully the changelog for supported versions.
Currently, it only supports addition of nodes. Reason for this is we cannot infer whether a node is down for maintenance/crash or because it has been removed. In other words, we do not have guarantees a node down will never come back up again.
Later on, we could implement a periodic check to see if the cluster size is still the expected one. Right now I would like to see how this version behaves for a while at least.
Thanks for your contribution!
Have been running into issues with cache tables not existing. It's the result of a combination of:
This created an interesting situation where the rabbitmq broker was up, working, but msgs were failing to go through the exchange (stack trace below).
This PR implements distribution of cache tables across nodes at time of routing (same nodes that would be returned by
cache_create
minus those already on the cache table). This doesn't fix the above issue once it has happened but should reduce the probability of the occurrence. I am not a big fan of the path I took to do this though because it does this check on cache table distribution on everyroute
call to exchange. Being new to elixir and mq plugin development I wasn't sure how to do this in a more elegant/efficient manner. Hoped by raising this you might have ideas on a better implementation.P.S. I did go down the route of trying to add a copy of the cache table on
insert
into the cache after the node had died, found out the hard way that you can't create the table because it is still in the schema, can't delete the table b/c it doesn't exist. Trying to mimic this by stopping mnesia, runningdel_table_copy
doesn't delete the vestigial table either.