duplicates messages are not correctly deduplicated using dedup at exchange level

fish68 commented 2 years ago

hello, I have the following setup:

RabbitMQ 3.8.27 (Erlang 23.2) running on Windows Server 2016 I've recently upgraded to elixir 1.12.2 and deduplication plugin 0.5.3.

I run a 3 nodes cluster with the following policy to mirror queues, exchanges are not mirrored:

I have an exchange configured in the following way:

the deduplication works pretty fine most of the times, but I have some messages not correctly deduplicated and consumed twice.

for example yesterday I had 4 cases out of ~9000 unique messages; in one of the cases the publish timestamps were:

2022-02-08 14:28:37.550 2022-02-08 14:28:37.587

so it should be handled by the configured TTL.

the deduplication key is a string (something like 101.15010) and the average payload is a JSON of about 600bytes, maximum 1Kbyte.

do you have any advice on how this could be solved or how to get some internal logs about the plugin? thank you Roberto

noxdafox commented 2 years ago

I think the culprit might lie in the cache being full.

Currently, the x-cache-size parameter controls the maximum amount of members within the cache. When the cache overflows, an item is removed to make space for the new one. Ideally, the new item should replace the oldest item (LRU caching) but currently this is not the case. This is a limitation of the caching layer I am planning to fix in the future.

So what might be happening is that you have a full cache, a new item comes in and replace a very recent one resulting in what you observe.

To reduce the occurrence of such issues i would recommend to increase the x-cache-size parameter and see if it helps. Right now you have a value of 10000 and if you publish ~9000 unique messages you are pretty close to overflow. If there where old messages already in the cache this might explain the issue.

fish68 commented 2 years ago

Hi, thank you for the very quick reply; of course I can try to increase the size but those 9000 messages were published over a period of 2 hours, I guessed the cache would start to remove the oldest and keep the newest messages. all my duplicated messages are published within few milliseconds from each other.

My fear is that increasing the cache to a 10x value could be solving the issue for the moment but that I could have a similar issue after few days of processing

noxdafox commented 2 years ago

Nevermind what I just said above. I now recall the x-cache-ttl is in milliseconds and not seconds. Hence your entries are having a 1 second TTL and not ~20 minutes.

If I understand correctly, the timestamp above are generated at the publisher side. Am I correct?

If so, are you sure you are not affected by network congestion induced delays?

Publisher A creates message A1 and timestamps it.
Publisher A publishes the message immediately.
Publisher B creates message B1 timestamps it.
Publisher B tries to publish the message but network is congested and there is a delay.
RMQ receives message A1, forwards it and add it to the cache.
A1 cache entry expires
Finally B1 arrives, no cache entry for it. Message goes through.

One way you could try debugging that is by using the timestamp exchange to get the message timestamp when it reaches the broker. In this way you could spot if these are the cases which affect you. Obviously, there might still be delays between the time the broker receives the message and the moment it actually compares it with the duplicates cache. This might be due to a very busy broker.

fish68 commented 2 years ago

I don't think it's that; I'm logging also at consumer side, so the messages are published at the following timestamps

2022-02-08 14:28:37.550 2022-02-08 14:28:37.587 and received at

2022-02-08 14:28:37.633 2022-02-08 14:28:37.700

so TTL is less than 1000 milliseconds between publish and consuming

I fear that the problem could be related to the LB hitting a different server of the cluster, is there any way to log that?

noxdafox commented 2 years ago

The cache is shared across nodes and it's transaction-based. It's not local per node.

If it would be your LB sending the client to different nodes, you would notice a way higher amount of duplicates I guess?

fish68 commented 2 years ago

ok, then I have no clue... according to the documentation and to what you wrote me above it should be working... and it's actually fine most of the times

noxdafox commented 2 years ago

I will try to reproduce locally and perform some large-scale testing. I never observed such behaviour before in my local tests.

Unfortunately, I am quite swamped at the moment. I hope I get some spare time at the beginning of March for this.

Are you sure the client is not overriding the TTL and the cache is not overflowing? You can check the cache size by querying the exchange info.

Example:

$ curl -s -u guest:guest http://localhost:15672/api/exchanges | jq '.[] | select(.name=="exchange-name")'

{
  "arguments": {
    "x-cache-size": 1000,
    "x-cache-ttl": 100000
  },
   "cache_info": {
    "bytes": 3224,
    "entries": 9,
    "size": 1000
  }
  ...
}

fish68 commented 2 years ago

ok, thank you... I will try to query the exchange to double check those params

fish68 commented 2 years ago

I have checked tha params and they look correct

 {
      "arguments":{
         "x-cache-size":10000,
         "x-cache-ttl":"1000"
      },
      "auto_delete":false,
      "cache_info":{
         "bytes":2472,
         "entries":0,
         "size":10000
      },
      "durable":true,
      "internal":false,
      "message_stats":{
         "publish_in":4956,
         "publish_in_details":{
            "rate":0.0
         },
         "publish_out":2500,
         "publish_out_details":{
            "rate":0.0
         }
      },
      "name":"ex-Gateway-1",
      "type":"x-message-deduplication",
      "user_who_performed_action":"myuser",
      "vhost":"SAC"
   }

I will recheck them while stressing the system to see if the number of elements in the cache gets too high

noxdafox commented 2 years ago

It would be great to get those numbers especially close to a moment in which we see duplicates occurring.

On my side I'll try spinning a docker-compose environment and bombarding the system, but I can only start from March.

fish68 commented 2 years ago

yes, I think I can do it.. next time I start a massive test I will keep monitoring those values

neelkamath commented 1 year ago

I'm facing this issue too.

I have an exchange named deduplicator:

I've set the relevant arguments such as x-cache-persistence: disk, and bound it to a queue named temp (usually it's just bound to msgs but I added temp here for the demo):

The temp queue received 16 messages having the same deduplication header as shown via "find in page":

noxdafox / rabbitmq-message-deduplication

duplicates messages are not correctly deduplicated using dedup at exchange level #80