Open fish68 opened 2 years ago
I think the culprit might lie in the cache being full.
Currently, the x-cache-size
parameter controls the maximum amount of members within the cache. When the cache overflows, an item is removed to make space for the new one. Ideally, the new item should replace the oldest item (LRU caching) but currently this is not the case. This is a limitation of the caching layer I am planning to fix in the future.
So what might be happening is that you have a full cache, a new item comes in and replace a very recent one resulting in what you observe.
To reduce the occurrence of such issues i would recommend to increase the x-cache-size
parameter and see if it helps. Right now you have a value of 10000
and if you publish ~9000
unique messages you are pretty close to overflow. If there where old messages already in the cache this might explain the issue.
Hi, thank you for the very quick reply; of course I can try to increase the size but those 9000 messages were published over a period of 2 hours, I guessed the cache would start to remove the oldest and keep the newest messages. all my duplicated messages are published within few milliseconds from each other.
My fear is that increasing the cache to a 10x value could be solving the issue for the moment but that I could have a similar issue after few days of processing
Nevermind what I just said above. I now recall the x-cache-ttl
is in milliseconds and not seconds. Hence your entries are having a 1 second TTL and not ~20 minutes.
If I understand correctly, the timestamp above are generated at the publisher side. Am I correct?
If so, are you sure you are not affected by network congestion induced delays?
A
creates message A1
and timestamps it.A
publishes the message immediately.B
creates message B1
timestamps it.B
tries to publish the message but network is congested and there is a delay.A1
, forwards it and add it to the cache.A1
cache entry expiresB1
arrives, no cache entry for it. Message goes through.One way you could try debugging that is by using the timestamp exchange to get the message timestamp when it reaches the broker. In this way you could spot if these are the cases which affect you. Obviously, there might still be delays between the time the broker receives the message and the moment it actually compares it with the duplicates cache. This might be due to a very busy broker.
I don't think it's that; I'm logging also at consumer side, so the messages are published at the following timestamps
2022-02-08 14:28:37.550 2022-02-08 14:28:37.587 and received at
2022-02-08 14:28:37.633 2022-02-08 14:28:37.700
so TTL is less than 1000 milliseconds between publish and consuming
I fear that the problem could be related to the LB hitting a different server of the cluster, is there any way to log that?
The cache is shared across nodes and it's transaction-based. It's not local per node.
If it would be your LB sending the client to different nodes, you would notice a way higher amount of duplicates I guess?
ok, then I have no clue... according to the documentation and to what you wrote me above it should be working... and it's actually fine most of the times
I will try to reproduce locally and perform some large-scale testing. I never observed such behaviour before in my local tests.
Unfortunately, I am quite swamped at the moment. I hope I get some spare time at the beginning of March for this.
Are you sure the client is not overriding the TTL and the cache is not overflowing? You can check the cache size by querying the exchange info.
Example:
$ curl -s -u guest:guest http://localhost:15672/api/exchanges | jq '.[] | select(.name=="exchange-name")'
{
"arguments": {
"x-cache-size": 1000,
"x-cache-ttl": 100000
},
"cache_info": {
"bytes": 3224,
"entries": 9,
"size": 1000
}
...
}
ok, thank you... I will try to query the exchange to double check those params
I have checked tha params and they look correct
{
"arguments":{
"x-cache-size":10000,
"x-cache-ttl":"1000"
},
"auto_delete":false,
"cache_info":{
"bytes":2472,
"entries":0,
"size":10000
},
"durable":true,
"internal":false,
"message_stats":{
"publish_in":4956,
"publish_in_details":{
"rate":0.0
},
"publish_out":2500,
"publish_out_details":{
"rate":0.0
}
},
"name":"ex-Gateway-1",
"type":"x-message-deduplication",
"user_who_performed_action":"myuser",
"vhost":"SAC"
}
I will recheck them while stressing the system to see if the number of elements in the cache gets too high
It would be great to get those numbers especially close to a moment in which we see duplicates occurring.
On my side I'll try spinning a docker-compose
environment and bombarding the system, but I can only start from March.
yes, I think I can do it.. next time I start a massive test I will keep monitoring those values
I'm facing this issue too.
I have an exchange named deduplicator
:
I've set the relevant arguments such as x-cache-persistence: disk
, and bound it to a queue named temp
(usually it's just bound to msgs
but I added temp
here for the demo):
The temp
queue received 16 messages having the same deduplication header as shown via "find in page":
hello, I have the following setup:
RabbitMQ 3.8.27 (Erlang 23.2) running on Windows Server 2016 I've recently upgraded to elixir 1.12.2 and deduplication plugin 0.5.3.
I run a 3 nodes cluster with the following policy to mirror queues, exchanges are not mirrored:
I have an exchange configured in the following way:
the deduplication works pretty fine most of the times, but I have some messages not correctly deduplicated and consumed twice.
for example yesterday I had 4 cases out of ~9000 unique messages; in one of the cases the publish timestamps were:
2022-02-08 14:28:37.550 2022-02-08 14:28:37.587
so it should be handled by the configured TTL.
the deduplication key is a string (something like 101.15010) and the average payload is a JSON of about 600bytes, maximum 1Kbyte.
do you have any advice on how this could be solved or how to get some internal logs about the plugin? thank you Roberto