nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 283 forks source link

Understanding Cluster ID usage in SuperCluster #1174

Closed arpitkh96 closed 3 years ago

arpitkh96 commented 3 years ago

I am trying to implement HA using 2 NATS clusters connected using gateways. For a topic a cluster may have a subscriber. If the local subscriber is not present then the subscriber in other cluster (same qgroup) should handle the message. I have two NATS (raft) clusters in a supercluster connected using public IPs for each NATS instance.

Setup 1

   producer   ->   (nats-streaming-cluster-1)  <===>    (nats-streaming-cluster-2)  -> subscriber 

I tried the following cases

  1. Publisher and subscriber using same cluster ID
    The messages are transmitted to the second cluster
  2. Publisher and subscriber using their corresponding local cluster ID The messages are not transmitted to the second cluster

Setup 2

   producer   ->   (nats-streaming-cluster-1)  <===>    (nats-streaming-cluster-2)  -> subscriber 2
                             /
                           /
                       Subscriber 1
  1. All Publisher and subscribers using same cluster ID.
    • Case 1: Same qgroup name The messages are load balanced between subscribers 1 and 2.
    • Case 2: Different qgroup name All the messages are recieved by both consumers.
  2. Publisher and subscriber using their corresponding local cluster ID
    • Case 1 : Same qgroup name
      Message not transmitted to subscriber 2 (only subscriber 1 gets it) ( Kill subscriber 1 , and publish more messages , new messages do no reach subscriber 2)
    • Case 2: Different qgroup name: Message not transmitted to subscriber 2 (only subscriber 1 gets it)

I am not able to simulate the expected Qgroup behavior in NATS supercluster. What am I missing? My expectation is to see messages getting delivered to local subscriber if it is alive, otherwise messages getting proxied to second subscriber

Configuration : 3 nats instances in each cluster, each having their own public IP, and all instances have the gateway config containing all the public IPs, configured statically.

kozlovic commented 3 years ago

@arpitkh96 Just to be sure: you are using different cluster IDs for those two clusters, right? If not, there is no need to proceed further since this is a misconfiguration that will lead to very bad side effect, especially if the peer names are the same in each cluster.

Assuming that you are using different cluster ID, I am going to try answer your questions:

Publisher and subscriber using same cluster ID The messages are transmitted to the second cluster

If you mean that you have a producer on "cluster1" side and a consumer on "cluster2" side but both publisher and consumer use the same cluster ID in their Connect() call, then it is expected that the consumer receives the message(s) because they are NATS connected and messages can flow. However, the messages should be persisted only in one cluster, not both, otherwise I think you have a misconfiguration and likely using the same cluster ID in all 6 servers.

Publisher and subscriber using their corresponding local cluster ID The messages are not transmitted to the second cluster

It just means that say publisher is on cluster1 side and connects with clusterID "cluster1", while the consumer is on "cluster2" side and connects with clusterID "cluster2" and therefore don't exchange data since they are different clusters. It does not matter where the publisher/consumer are, they could be located on the same machine and still would not see each other traffic if the cluster IDs are different (isolation).

I am not able to simulate the expected Qgroup behavior in NATS supercluster. What am I missing?

That NATS Streaming is not core NATS. The streaming server is not a NATS server. It is a library that uses NATS to implement persistence. See https://docs.nats.io/nats-streaming-concepts/intro and https://docs.nats.io/nats-streaming-concepts/relation-to-nats.

Therefore, the streaming QueueSubscribe() call does not create a core NATS queue subscription, it is still a NATS subscription. The Streaming server (a library really) will pick one of the subscription from the group and send it to its "private" inbox. So there is not the behavior of core NATS queue groups across super-cluster.

All that said, if you are just starting with NATS Streaming, I would recommend that you look at JetStream instead (https://docs.nats.io/whats_new_22#next-generation-streaming) since it will replace NATS Streaming and will address some of the things you are trying to do with NATS Streaming that won't work.

arpitkh96 commented 3 years ago

@kozlovic I have different cluster IDs for each cluster (nats-streaming-cluster-1 and nats-streaming-cluster-2 are cluster IDs). Sorry for the lack of clarity in the question. I am trying to implement HA in nats-streaming with durable queues and offsets ( not for core NATS).

I am getting the same behavior as you described in the following case. Only one cluster shows the topic descriptor on the monitoring endpoint.

If you mean that you have a producer on "cluster1" side and a consumer on "cluster2" side but both publisher and consumer use the same cluster ID in their Connect() call, then it is expected that the consumer receives the message(s) because they are NATS connected and messages can flow.

Coming to your second answer

It just means that say publisher is on cluster1 side and connects with clusterID "cluster1", while the consumer is on "cluster2" side and connects with clusterID "cluster2" and therefore don't exchange data since they are different clusters. It does not matter where the publisher/consumer are, they could be located on the same machine and still would not see each other traffic if the cluster IDs are different (isolation).

My expectation was that, In my setup, If the publisher in cluster-1 sends a message (using STAN client) , it should be delivered to local consumer if present. Otherwise message should be delivered to a consumer in cluster-2. If the clusters are isolated then how can I get this expected behavior ?

Therefore, the streaming QueueSubscribe() call does not create a core NATS queue subscription, it is still a NATS subscription. The Streaming server (a library really) will pick one of the subscription from the group and send it to its "private" inbox. So there is not the behavior of core NATS queue groups across super-cluster.

Not sure if I understand this right . I am not expecting the messages to be loadbalanced between consumer-1 and consumer-2 (in different clusters) . I am expecting the failover to work if consumer-1 (local) goes down.

PS- I will explore jetstream next

kozlovic commented 3 years ago

My expectation was that, In my setup, If the publisher in cluster-1 sends a message (using STAN client) , it should be delivered to local consumer if present. Otherwise message should be delivered to a consumer in cluster-2. If the clusters are isolated then how can I get this expected behavior ?

No, because this is specific to core NATS queue subscriptions. Hence my comment that in Streaming, a QueueSubscribe() is still creating a regular subscription on some inbox. The Streaming server knows about members of the queue group and picks a consumer and send the message to the consumer's inbox.

So there is no way to support the "queue group super-cluster failover behavior".

Not sure if I understand this right . I am not expecting the messages to be loadbalanced between consumer-1 and consumer-2 (in different clusters) . I am expecting the failover to work if consumer-1 (local) goes down.

I understand what you were saying, but what I am saying is that even if you have several queue members of the same group (for the same cluster ID) spread across different clusters in a super cluster, you will not benefit from the core NATS queue subscription failover behavior with super clusters because internally streaming queue subscriptions are regular subscriptions.

arpitkh96 commented 3 years ago

Thanks for the clarification @kozlovic . That was really helpful. Closing issue .