Connection issues in FT mode

demelvin commented 4 years ago

Hi all 👋

I'm having an issue with the clients ability to establish an initial connection when running in fault tolerance (FT) mode within a k8s cluster.

Behavior

Run Node.js server, nats client connects to FT leader
Delete leader pod, client re-connects = expected/perfect
Restart Node.js server

Result

Client connection is closed on startup.
error event fires with the following message:

{
    "name": "NatsError",
    "message": "stan: connect request timeout",
    "code": "stan: connect request timeout",
    "chainedError": {
        "name": "NatsError",
        "message": "The request timed out for subscription id: -1",
        "code": "REQ_TIMEOUT"
    }
}

close event fires, process.exit() is called, server restarts over and over again repeating the above resulted steps until a connection can be established.

Expected Result Client connects first time without having to restart multiple times until a connection is established.

Nats Streaming Version Docker Image nats-streaming:latest Embedded Nats Server

Client Version latest (v0.3.3-0)

I've been combing through the docs, github issues, as well as experimenting with various settings on both the deployment and the client. Any ideas on how to solve this?

Thank you in advance for your help.

aricart commented 4 years ago

@demelvin the default connectTimeout for protocol exchanges is set to 2000, wondering if it needs to be higher on your setup. - This setting controls the protocol exchanges with the nats-streaming-server.

kozlovic commented 4 years ago

Delete leader pod, client re-connects = expected/perfect

Could you clarify if by that you mean that the NATS connection was reconnected, or the streaming one? As you know, there is no TCP connection between a client and the streaming server, so a streaming connection is more like a "session". So when NATS can reconnect to a NATS Server (even if embedded in Streaming server), this does not mean that the streaming connection between the streaming server and client is actually restored.

Have you checked the server log of the non deleted pod to see if one of the standby is properly becoming the active server? You could also run the servers with -SD and get a bit more streaming debugging to see what happens when you restart your Node.js server and it fails to connect (assuming that it reaches any of the existing Streaming pods).

demelvin commented 4 years ago

@aricart

Thanks for the suggestion, I'll try bumping up the connectTimeout setting on the client and see if that resolves this.

@kozlovic

Could you clarify if by that you mean that the NATS connection was reconnected

The connection I was referring to was specifically the Node.js client connection. After deleting the leader pod the client event "reconnect" fires and the connection to the new leader is successful. I was tailing the logs of each pod while testing but the logs were a bit convoluted because I was running the container with the -DV flag. I'll try running with the -SD flag as you suggested. Hopefully that will reveal an anomaly within my k8s and/or client config.

Unfortunately I'm dealing with production issues this morning. I should be able to test this out and report back my findings later today.

Thanks again for your help guys. It's very much appreciated.

aricart commented 4 years ago

@demelvin just to reiterate, reconnect is just the nats connection, the timeout on the protocol exchange indicates that the nats-streaming-server was not available at the time of the request (even if the nats-server is).

demelvin commented 4 years ago

@aricart I only included the re-connect in my initial comment because I thought it might help however, the main issue is when the server is started|restarted it simply will not connect to NATS and the node process must be restarted a few times before it is successful. My apologies for any confusion there.

I played around with the connectTimeout setting as well as enabled the -SD option on the Stan k8s deployment.

`connectTimeout`

The connectTimeout setting unfortunately did not change anything. I tried a range with this setting. Initially setting this to 5 seconds (5000 millis) and bumping it up all the way to 1 minute. Although the client did wait for the specified period the timeout eventually happened.

`-SD` Observations

When connecting I'm using a uuid as the client id so it should be unique each time the node client attempts to connect. I was not able to see any indication of this client attempting to connect to NATS when the timeout occurs which is strange. Once the client successfully connects I can see it in the logs:

e.g.

[1] 2020/06/30 03:32:06.582179 [DBG] STREAM: [Client:4289c391-ef23-4dac-84ec-0ca4acb77c47] Connected (Inbox=_INBOX.JR2M92VGT2LOMKX2BV53Y4)

Whats even more strange is successful / unsuccessful connection attempts seem to be completely random. Sometimes the stan.js client connects immediately, sometimes it takes 3-4 attempts 🤷‍♂️

I'm seeing the same type of behavior within the cluster as well as on my local machine which makes me think that perhaps something is up with the ingress or the k8s service itself. I'm going to go over the setup documentation one more time to see if there is something I missed.

Workaround This is the workaround I've currently implemented to get around this. It's not ideal but seems to work.

Start the server and attempt to connect to NATS.
Add connection event listeners
If close event fires rinse and repeat the above until a successful connection is made

Thanks again for your help with this. I'll post again here should I have anything useful to add.

aricart commented 4 years ago

When connecting I'm using a uuid

The client UUID can be (and in some cases should be) stable across process invocations. NATS doesn't care about the client id (it does report a client name if you set it on the nats connection options). You might consider specifying the same UUID as the stan client ID and as the client name).

aricart commented 4 years ago

Just to make sure, I did a small gist that sets up 2 nats-streaming-servers in cluster mode using FT. You can control-c the servers one at time, both etc, and see what the client is doing. The client publishes a message every second, so the expected behaviour is that you see a message printed:

connected to nats://localhost:4222
[13]: 1
[14]: 2
[15]: 3
[16]: 4
[17]: 5
[18]: 6
[19]: 7
disconnected
reconnected to nats://localhost:2224
[20]: 8
[21]: 9
[22]: 10
[23]: 11
[24]: 12
[25]: 13
...
[41]: 29
[42]: 30
[43]: 31
[44]: 32
[45]: 33
[46]: 34
[47]: 35
[48]: 36
[49]: 37
^C

demelvin commented 4 years ago

@aricart Thank you for providing that gist. Running this locally I can see the desired FT behavior.

After looking over the NATS streaming FT documentation and making some changes to my k8s manifest the client is connecting just fine now. The close event does still fire every once in awhile when I test pod deletions (e.g. stan-0, stan-1, stan-2) but ultimately this is a k8s misconfiguration and not a stan.js client issue.

In case this might help somebody else. I added the following to my container args within the k8s deployment manifest. After doing this the client is able to connect successfully on server startup/restart without issue.

- '--ft_group'
- 'ft'
- '--cluster'
- 'nats://0.0.0.0:6222'
- '--cluster_node_id'
- '$(POD_NAME)'
- '--routes'
- 'nats://stan:6222'
- '--cluster_peers'
- 'stan-0, stan-1, stan-2'

Edit (Based on Ivan's Comments)

- '--ft_group'
- 'ft'
- '--cluster'
- 'nats://0.0.0.0:6222'
- '--cluster_node_id'
- '$(POD_NAME)'
- '--routes'
- 'nats://stan:6222'

Closing issue. Thanks again for all your help 👍

kozlovic commented 4 years ago

@demelvin Following the link that you originally posted, it does already configure the ft_group_name, so you should not have to pass additional args? (https://docs.nats.io/nats-on-kubernetes/stan-ft-k8s-aws#setting-up-the-nats-streaming-cluster) Maybe @wallyqs can comment on that.

Definitively, you should NOT pass --cluster_peers simply because this is for Streaming clustering mode, which is opposite to running in FT mode. (it won't hurt specifying it and will be ignored, but could be misleading).

demelvin commented 4 years ago

@kozlovic Thanks for the suggestion. I edited my original comment, removing the cluster_peers argument.

I found not providing the --cluster, the --routes, and the --cluster_node_id options in the k8s config I was basically back at square one which was the stan.js node client connecting only sometimes.

I'll play around with it some more. Thanks again.

kozlovic commented 4 years ago

I am happy that it works for you and could leave at that, but again, if you followed https://docs.nats.io/nats-on-kubernetes/stan-ft-k8s-aws, I see that it already has the cluster/routes configured:

    cluster {
      port: 6222
      routes [
        nats://stan:6222
      ]
      cluster_advertise: $CLUSTER_ADVERTISE
      connect_retries: 10
    }

So you should not have to pass it through command line.

And again, --cluster_node_id is for NATS Streaming Cluster mode, which is different from FT mode, and should not be used in that case. Maybe if you were to provide the whole config, @wallyqs can have a look and see what's wrong.

demelvin commented 4 years ago

@kozlovic I'd be happy to provide my config, any help to resolve is welcome here.

Several things to note here:

I'm running my cluster on Digital Ocean not AWS
I'm using SQL store not file store.
I'm not using a k8s ConfigMap but rather just passing the config options as args to the container within the StatefulSet

Please let me know if you see any improvements or changes that need be made with this setup. I did not see an option to provide the cluster_advertise or connect_retries to the docker image, perhaps that is the issue 🤔

k8s Manifest

---
apiVersion: v1
kind: Service
metadata:
  name: stan
  labels:
    app: stan
spec:
  selector:
    app: stan
  clusterIP: None
  ports:
  - name: client
    port: 4222
  - name: cluster
    port: 6222
  - name: monitor
    port: 8222
  - name: metrics
    port: 7777
  - name: console
    port: 8282
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: stan
  labels:
    app: stan
spec:
  selector:
    matchLabels:
      app: stan
  serviceName: stan
  replicas: 1
  template:
    metadata:
      labels:
        app: stan
    spec:
      terminationGracePeriodSeconds: 30
      containers:
      - name: stan
        image: nats-streaming:latest
        ports:
        # In case of NATS embedded mode expose these ports
        - containerPort: 4222
          name: client
        - containerPort: 6222
          name: cluster
        - containerPort: 8222
          name: monitor
        args:
         - "--cluster_id"
         - "test-cluster"
         - "--store"
         - "sql"
         - "--sql_driver"
         - "postgres"
         - "--sql_source"
         - "<DB_CONNECTION_URL_GOES_HERE>"
         - "-p"
         - "4222"
         - '--max_channels'
         - '0'
         - '--max_subs'
         - '0'
         - '--max_msgs'
         - '0'
         - '--ft_group'
         - 'test-cluster'
         - '--cluster'
         - 'nats://0.0.0.0:6222'
         - '--cluster_node_id'
         - '$(POD_NAME)'
         - '--routes'
         - 'nats://stan:6222'
         - '-SDV'
.....

stan.js client

const sc = stan.connect(`test-cluster`, `test-client`, {
  url: 'nats://stan:4222',
  maxReconnectAttempts: 10,
  reconnectTimeWait: 5 * 1000,
  json: true,
  waitOnFirstConnect: true,
});

Hope this reveals something incorrect with the config or setup. Please let me know if something sticks out.

Thanks again everyone 👍

kozlovic commented 4 years ago

Yes, some of the configuration cannot be set through command line, so you would need to use a configuration file. I don't know much about k8s so would let @wallyqs comment, but I see several things:

you have replicas: 1, so it looks to me that there would be only one instance, so really if that pod dies, there is no server running?

in our doc there is this:

    # Required to be able to define an environment variable
    # that refers to other environment variables.  This env var
    # is later used as part of the configuration file.
   env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          fieldPath: metadata.namespace
    - name: CLUSTER_ADVERTISE
      value: $(POD_NAME).stan.$(POD_NAMESPACE).svc
...

That's how then in the config you would be able to specify cluster_advertise: $CLUSTER_ADVERTISE and the routes would be pointing to any stan pod.

I still don't understand why you are required to have this in your config:
```
- '--cluster_node_id'
- '$(POD_NAME)'
```
this is clustering mode, not FT. I believe that you said if you remove this it "does not work", right?

demelvin commented 4 years ago

Thanks for the quick reply.

you have replicas: 1

Sorry that was a miscopy from my production yaml. replicas: 3 is what I'm running in staging. Nice catch 👏

I still don't understand why you are required to have this in your config: '--cluster_node_id'

I'll remove that '--cluster_node_id' no worries there. Out of desperation I started adding random cluster args for testing. If this isn't needed then its officially gone.

`cluster_advertise`

It sounds like cluster_advertise: $CLUSTER_ADVERTISE is a requirement for FT then? If so my best guess is that is the issue here. I was trying to avoid creating a ConfigMap resource in the cluster but if this is the only way then its 100% worth a shot and also not a big deal to add this.

I'll try it and let you know the results.

kozlovic commented 4 years ago

You can specify -cluster_advertise from the command line. The idea behind cluster_advertise is that the IP that is gossiped to client connections will be used by the library to auto-reconnect. If the IP is not accessible from outside of the server pods, clients would try to reconnect to an IP that has no meaning outside of the pods. Cluster advertise solve this issue. Now, if you configure the clients with the list of correct IPs of the streaming active/standby servers, then that would not be needed.

demelvin commented 4 years ago

-cluster_advertise from the command line

Interesting, I didn't see that in the docker documentation. Good to know thank you for that.

Based on your previous comment. If I understand this correctly.

cluster_advertise - Advertise the the available nats streaming pods from within the k8s cluster. This will allow other pods using the stan.js client to connect the Stan leader without the need for list of Stan pod ip's when configuration the client.

Outside of the cluster, cluster advertise is meaningless to the stan.js client and the Ingress IP address is needed to connect. My question is, when the client is specifying a single IP address how does the k8s service know which nats streaming pod within the ReplicaSet to establish the client connection with? client -> service -> [stan-0, stan-1, or stan2]

kozlovic commented 4 years ago

Interesting, I didn't see that in the docker documentation. Good to know thank you for that.

That's because this is more of a NATS Server configuration. NATS Streaming does not repeat all config that are available to NATS Server. Sorry about that.

cluster_advertise - Advertise the the available nats streaming pods from within the k8s cluster. This will allow other pods using the stan.js client to connect the Stan leader without the need for list of Stan pod ip's when configuration the client.

Not really. Again, this is low-level NATS Server. What that does is that if a client connect to 1 server that is clustered (here server means either NATS Server or NATS Streaming Server in your case since you run the embedded version), that server will return to the client library the list of URLs of the other servers, so that if the client were to be disconnected, it could try to reconnect to any of the URLs. Once you have a cluster of servers, it doesn't matter really which server a client connects to, servers will route the traffic appropriately. For sure, if a client connects directly to the Streaming active server, there is a bit of an advantage in reducing the number of hop the messages have to go through.

kozlovic commented 4 years ago

@demelvin I am so sorry I led you the wrong way. What I described is the behavior of "client_advertise", not "cluster_advertise". Cluster advertise is similar but really for server routes. Suppose you have servers S1 and S2 with a public IP respectively of PUB1 and PUB2, and private PRIV1 and PRIV2. Then S1 connects to S2 and forms a cluster. Now say you want to add a server S3 to the cluster but this server does not have access to private IPs of S1 and S2. If S3 connects to say PUB2 (the public IP of S2), as part of gossip protocol, S2 will tell S3 to connect to S1 to form a full mesh. But S2 will send to S3 the IP of S1 based on the remote address it got when accepting connection from S1. If that is the private IP, then when S3 will try to connect to S1, this may fail. The cluster_advertise is a way to provide a desired host:port that server should gossip between each other. In the above example, if S1 had configured cluster_advertise with PUB1, then S2 would have sent that (and not the remote IP address) to S3, which then would have been able to create the connection.

To sum up, if you are able to configure a full mesh cluster without cluster_advertise, then you don't need to add that. Also, if your clients can resolve the IPs returned by each pod, then you don't need client_advertise either. A simple way to test would be from a machine where your app is normally running, to a "telnet ", say using the url info your client normally uses. You should receive in the telnet session an INFO block that will connect a field called connect_urls. If the IPs listed there are all valid and reachable from the client host, then you have nothing to do.

demelvin commented 4 years ago

@kozlovic No reason to apologize, thank you for the explanation.

A simple way to test would be from a machine where your app is normally running, to a "telnet ", say using the url info your client normally uses. You should receive in the telnet session an INFO block that will connect a field called connect_urls. If the IPs listed there are all valid and reachable from the client host, then you have nothing to do.

Thanks for the tip. I'll try and telnet and see if I can get some more information.

I updated my k8s deployment to use the ConfigMap as described in the docs last night and I was seeing the same behavior as previously described. Everything is setup exactly as described in the documentation with the exception of SQL store vs file store:

Use Case 1

Server starts client Connects
Server Running Client is Killed (manually CTRL + C)
Server Running Client attempts to connect and is unable to do so (this is very random, sometimes it does connect on restart)

NOTE I'm killing my pods and they fail a health check so are restarted and will continue to do so until able to connect to NATS

Use Case 2

Pod running stan deleted (now there are 2 pods/servers running stan)
Server running client re-connects (this is good)
Another pod running stan deleted (now there is 1 pod running stan)
Client fails to connect 😢

Are there any specific stan.js client options I should be providing in order to ensure that the client will connect/re-connect to the leader when using FT mode with NATS? Unfortunately I'm still having connection issues and I'm afraid I've exhausted all my options. I'm wondering if this is just a behavior when running an embedded NATS server in conjunction with NATS Streaming

NOTE The client I'm testing with is local and I'm connecting to a k8s cluster which means it does not have access to the internal IP's of the servers running stan. Regardless of that, I'm seeing the same type of behavior on servers (pods) running inside of the cluster which tells me something is up as those servers/pods do have access to those IP's.

I'm thinking I might just take a different route and go with HA setup if I cannot get the client to work with fault tolerance. I want to be mindful of your time so at this point if you feel its better that I just move on I completely understand.

Thanks again.

kozlovic commented 4 years ago

I am a bit confused when you say "Server", do you mean the NATS server or your server app? Again, if it is one of your app that cannot reconnect, make sure this is not because of the IP it gets. Alternatively, provide the 3 pods public URLs to your app so they can use those to try to reconnect. Also check your Streaming servers logs and make sure that:

They form a cluster before stopping one of the pod
One (and only one) claim to be the active server. When deleting a pod, make sure that one of the 2 other pods report that the streaming server there is claiming to be active.

The HA setup would possibly have the same issues with the IPs from a client perspective. Also, if you go the HA route, make sure that each node as its own SQL DB store, because unlike with FT (where storage is shared), in clustering mode, each server needs to have its own storage (RAFT + streaming stores).

aricart commented 4 years ago

Are there any specific stan.js client options I should be providing in order to ensure that the client will connect/re-connect to the leader when using FT mode with NATS? Unfortunately I'm still having connection issues and I'm afraid I've exhausted all my options. I'm wondering if this is just a behavior when running an embedded NATS server in conjunction with NATS Streaming

If the NATS host:ports are stable on the different servers, there should be nothing to do, when the cluster changes, the client learns about it and retries it. Just to be clear, your maxReconnectAttempts option should be the default (-1).

You can also specify all the nats server host:ports on the client (servers property).

Also - are any of these localhost or 127.0.0.1 hosts?

Now - while it should be perfectly fine for you to run the NATS server from the embedded nats-streaming-server, have you tried separating the two clusters. If a stan server fails, it shouldn't take out the nats connection.

demelvin commented 4 years ago

I'm all set. I was finally able to get this to work properly and with fault tolerance mode.

Just to be clear, your maxReconnectAttempts option should be the default (-1).

This helped with the reconnect. No more dropped connection when killing pods.

Lastly, I was doing something really dumb. I was connecting to the server using http protocol and not the nats protocol. Once I updated the url to nats://<MY_INGRESS_IP>:4222 boom 💥 Everything works. Its always the little things that are overlooked and this was it.

Regardless, thank you both again for your help. I learned a lot about NATS and NATS Streaming during this process and I'm looking forward to pushing this out to production.

Derek

aricart commented 4 years ago

@demelvin current clients have started to support bare host:port for the URL - this makes sense (with the exception that url option is a misnomer). This may make it simpler and prevent that type of error.

nats-io / stan.js