Closed demelvin closed 4 years ago
@demelvin the default connectTimeout
for protocol exchanges is set to 2000
, wondering if it needs to be higher on your setup. - This setting controls the protocol exchanges with the nats-streaming-server.
Delete leader pod, client re-connects = expected/perfect
Could you clarify if by that you mean that the NATS connection was reconnected, or the streaming one? As you know, there is no TCP connection between a client and the streaming server, so a streaming connection is more like a "session". So when NATS can reconnect to a NATS Server (even if embedded in Streaming server), this does not mean that the streaming connection between the streaming server and client is actually restored.
Have you checked the server log of the non deleted pod to see if one of the standby is properly becoming the active server? You could also run the servers with -SD and get a bit more streaming debugging to see what happens when you restart your Node.js server and it fails to connect (assuming that it reaches any of the existing Streaming pods).
@aricart
Thanks for the suggestion, I'll try bumping up the connectTimeout
setting on the client and see if that resolves this.
@kozlovic
Could you clarify if by that you mean that the NATS connection was reconnected
The connection I was referring to was specifically the Node.js client connection. After deleting the leader pod the client event "reconnect"
fires and the connection to the new leader is successful. I was tailing the logs of each pod while testing but the logs were a bit convoluted because I was running the container with the -DV
flag. I'll try running with the -SD
flag as you suggested. Hopefully that will reveal an anomaly within my k8s and/or client config.
Unfortunately I'm dealing with production issues this morning. I should be able to test this out and report back my findings later today.
Thanks again for your help guys. It's very much appreciated.
@demelvin just to reiterate, reconnect
is just the nats connection, the timeout on the protocol exchange indicates that the nats-streaming-server was not available at the time of the request (even if the nats-server is).
@aricart I only included the re-connect in my initial comment because I thought it might help however, the main issue is when the server is started|restarted it simply will not connect to NATS and the node process must be restarted a few times before it is successful. My apologies for any confusion there.
I played around with the connectTimeout
setting as well as enabled the -SD
option on the Stan k8s deployment.
connectTimeout
The connectTimeout
setting unfortunately did not change anything. I tried a range with this setting. Initially setting this to 5 seconds (5000 millis) and bumping it up all the way to 1 minute. Although the client did wait for the specified period the timeout eventually happened.
-SD
ObservationsWhen connecting I'm using a uuid as the client id so it should be unique each time the node client attempts to connect. I was not able to see any indication of this client attempting to connect to NATS when the timeout occurs which is strange. Once the client successfully connects I can see it in the logs:
e.g.
[1] 2020/06/30 03:32:06.582179 [DBG] STREAM: [Client:4289c391-ef23-4dac-84ec-0ca4acb77c47] Connected (Inbox=_INBOX.JR2M92VGT2LOMKX2BV53Y4)
Whats even more strange is successful / unsuccessful connection attempts seem to be completely random. Sometimes the stan.js
client connects immediately, sometimes it takes 3-4 attempts š¤·āāļø
I'm seeing the same type of behavior within the cluster as well as on my local machine which makes me think that perhaps something is up with the ingress or the k8s service itself. I'm going to go over the setup documentation one more time to see if there is something I missed.
Workaround This is the workaround I've currently implemented to get around this. It's not ideal but seems to work.
close
event fires rinse and repeat the above until a successful connection is madeThanks again for your help with this. I'll post again here should I have anything useful to add.
When connecting I'm using a uuid
The client UUID can be (and in some cases should be) stable across process invocations. NATS doesn't care about the client id (it does report a client name
if you set it on the nats connection options). You might consider specifying the same UUID as the stan client ID and as the client name
).
Just to make sure, I did a small gist that sets up 2 nats-streaming-servers in cluster mode using FT. You can control-c
the servers one at time, both etc, and see what the client is doing. The client publishes a message every second, so the expected behaviour is that you see a message printed:
connected to nats://localhost:4222
[13]: 1
[14]: 2
[15]: 3
[16]: 4
[17]: 5
[18]: 6
[19]: 7
disconnected
reconnected to nats://localhost:2224
[20]: 8
[21]: 9
[22]: 10
[23]: 11
[24]: 12
[25]: 13
...
[41]: 29
[42]: 30
[43]: 31
[44]: 32
[45]: 33
[46]: 34
[47]: 35
[48]: 36
[49]: 37
^C
@aricart Thank you for providing that gist. Running this locally I can see the desired FT behavior.
After looking over the NATS streaming FT documentation and making some changes to my k8s manifest the client is connecting just fine now. The close
event does still fire every once in awhile when I test pod deletions (e.g. stan-0, stan-1, stan-2) but ultimately this is a k8s misconfiguration and not a stan.js client issue.
In case this might help somebody else. I added the following to my container args within the k8s deployment manifest. After doing this the client is able to connect successfully on server startup/restart without issue.
- '--ft_group'
- 'ft'
- '--cluster'
- 'nats://0.0.0.0:6222'
- '--cluster_node_id'
- '$(POD_NAME)'
- '--routes'
- 'nats://stan:6222'
- '--cluster_peers'
- 'stan-0, stan-1, stan-2'
- '--ft_group'
- 'ft'
- '--cluster'
- 'nats://0.0.0.0:6222'
- '--cluster_node_id'
- '$(POD_NAME)'
- '--routes'
- 'nats://stan:6222'
Closing issue. Thanks again for all your help š
@demelvin Following the link that you originally posted, it does already configure the ft_group_name, so you should not have to pass additional args? (https://docs.nats.io/nats-on-kubernetes/stan-ft-k8s-aws#setting-up-the-nats-streaming-cluster) Maybe @wallyqs can comment on that.
Definitively, you should NOT pass --cluster_peers
simply because this is for Streaming clustering mode, which is opposite to running in FT mode. (it won't hurt specifying it and will be ignored, but could be misleading).
@kozlovic Thanks for the suggestion. I edited my original comment, removing the cluster_peers
argument.
I found not providing the --cluster
, the --routes
, and the --cluster_node_id
options in the k8s config I was basically back at square one which was the stan.js
node client connecting only sometimes.
I'll play around with it some more. Thanks again.
I am happy that it works for you and could leave at that, but again, if you followed https://docs.nats.io/nats-on-kubernetes/stan-ft-k8s-aws, I see that it already has the cluster/routes configured:
cluster {
port: 6222
routes [
nats://stan:6222
]
cluster_advertise: $CLUSTER_ADVERTISE
connect_retries: 10
}
So you should not have to pass it through command line.
And again, --cluster_node_id
is for NATS Streaming Cluster mode, which is different from FT mode, and should not be used in that case.
Maybe if you were to provide the whole config, @wallyqs can have a look and see what's wrong.
@kozlovic I'd be happy to provide my config, any help to resolve is welcome here.
Several things to note here:
ConfigMap
but rather just passing the config options as args to the container within the StatefulSet
Please let me know if you see any improvements or changes that need be made with this setup. I did not see an option to provide the cluster_advertise
or connect_retries
to the docker image, perhaps that is the issue š¤
---
apiVersion: v1
kind: Service
metadata:
name: stan
labels:
app: stan
spec:
selector:
app: stan
clusterIP: None
ports:
- name: client
port: 4222
- name: cluster
port: 6222
- name: monitor
port: 8222
- name: metrics
port: 7777
- name: console
port: 8282
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: stan
labels:
app: stan
spec:
selector:
matchLabels:
app: stan
serviceName: stan
replicas: 1
template:
metadata:
labels:
app: stan
spec:
terminationGracePeriodSeconds: 30
containers:
- name: stan
image: nats-streaming:latest
ports:
# In case of NATS embedded mode expose these ports
- containerPort: 4222
name: client
- containerPort: 6222
name: cluster
- containerPort: 8222
name: monitor
args:
- "--cluster_id"
- "test-cluster"
- "--store"
- "sql"
- "--sql_driver"
- "postgres"
- "--sql_source"
- "<DB_CONNECTION_URL_GOES_HERE>"
- "-p"
- "4222"
- '--max_channels'
- '0'
- '--max_subs'
- '0'
- '--max_msgs'
- '0'
- '--ft_group'
- 'test-cluster'
- '--cluster'
- 'nats://0.0.0.0:6222'
- '--cluster_node_id'
- '$(POD_NAME)'
- '--routes'
- 'nats://stan:6222'
- '-SDV'
.....
const sc = stan.connect(`test-cluster`, `test-client`, {
url: 'nats://stan:4222',
maxReconnectAttempts: 10,
reconnectTimeWait: 5 * 1000,
json: true,
waitOnFirstConnect: true,
});
Hope this reveals something incorrect with the config or setup. Please let me know if something sticks out.
Thanks again everyone š
Yes, some of the configuration cannot be set through command line, so you would need to use a configuration file. I don't know much about k8s so would let @wallyqs comment, but I see several things:
replicas: 1
, so it looks to me that there would be only one instance, so really if that pod dies, there is no server running? # Required to be able to define an environment variable
# that refers to other environment variables. This env var
# is later used as part of the configuration file.
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CLUSTER_ADVERTISE
value: $(POD_NAME).stan.$(POD_NAMESPACE).svc
...
That's how then in the config you would be able to specify cluster_advertise: $CLUSTER_ADVERTISE
and the routes would be pointing to any stan
pod.
- '--cluster_node_id'
- '$(POD_NAME)'
this is clustering mode, not FT. I believe that you said if you remove this it "does not work", right?
Thanks for the quick reply.
you have replicas: 1
Sorry that was a miscopy from my production yaml. replicas: 3
is what I'm running in staging. Nice catch š
I still don't understand why you are required to have this in your config:
'--cluster_node_id'
I'll remove that '--cluster_node_id'
no worries there. Out of desperation I started adding random cluster args for testing. If this isn't needed then its officially gone.
cluster_advertise
It sounds like cluster_advertise: $CLUSTER_ADVERTISE
is a requirement for FT then? If so my best guess is that is the issue here. I was trying to avoid creating a ConfigMap
resource in the cluster but if this is the only way then its 100% worth a shot and also not a big deal to add this.
I'll try it and let you know the results.
You can specify -cluster_advertise
from the command line. The idea behind cluster_advertise is that the IP that is gossiped to client connections will be used by the library to auto-reconnect. If the IP is not accessible from outside of the server pods, clients would try to reconnect to an IP that has no meaning outside of the pods. Cluster advertise solve this issue.
Now, if you configure the clients with the list of correct IPs of the streaming active/standby servers, then that would not be needed.
-cluster_advertise from the command line
Interesting, I didn't see that in the docker documentation. Good to know thank you for that.
Based on your previous comment. If I understand this correctly.
cluster_advertise
- Advertise the the available nats streaming pods from within the k8s cluster. This will allow other pods using the stan.js
client to connect the Stan leader without the need for list of Stan pod ip's when configuration the client.
Outside of the cluster, cluster advertise is meaningless to the stan.js
client and the Ingress IP address is needed to connect. My question is, when the client is specifying a single IP address how does the k8s service know which nats streaming pod within the ReplicaSet to establish the client connection with? client -> service -> [stan-0, stan-1, or stan2]
Interesting, I didn't see that in the docker documentation. Good to know thank you for that.
That's because this is more of a NATS Server configuration. NATS Streaming does not repeat all config that are available to NATS Server. Sorry about that.
cluster_advertise - Advertise the the available nats streaming pods from within the k8s cluster. This will allow other pods using the stan.js client to connect the Stan leader without the need for list of Stan pod ip's when configuration the client.
Not really. Again, this is low-level NATS Server. What that does is that if a client connect to 1 server that is clustered (here server means either NATS Server or NATS Streaming Server in your case since you run the embedded version), that server will return to the client library the list of URLs of the other servers, so that if the client were to be disconnected, it could try to reconnect to any of the URLs. Once you have a cluster of servers, it doesn't matter really which server a client connects to, servers will route the traffic appropriately. For sure, if a client connects directly to the Streaming active server, there is a bit of an advantage in reducing the number of hop the messages have to go through.
@demelvin I am so sorry I led you the wrong way. What I described is the behavior of "client_advertise", not "cluster_advertise". Cluster advertise is similar but really for server routes. Suppose you have servers S1 and S2 with a public IP respectively of PUB1 and PUB2, and private PRIV1 and PRIV2. Then S1 connects to S2 and forms a cluster. Now say you want to add a server S3 to the cluster but this server does not have access to private IPs of S1 and S2. If S3 connects to say PUB2 (the public IP of S2), as part of gossip protocol, S2 will tell S3 to connect to S1 to form a full mesh. But S2 will send to S3 the IP of S1 based on the remote address it got when accepting connection from S1. If that is the private IP, then when S3 will try to connect to S1, this may fail. The cluster_advertise is a way to provide a desired host:port that server should gossip between each other. In the above example, if S1 had configured cluster_advertise with PUB1, then S2 would have sent that (and not the remote IP address) to S3, which then would have been able to create the connection.
To sum up, if you are able to configure a full mesh cluster without cluster_advertise, then you don't need to add that. Also, if your clients can resolve the IPs returned by each pod, then you don't need client_advertise either.
A simple way to test would be from a machine where your app is normally running, to a "telnet
@kozlovic No reason to apologize, thank you for the explanation.
A simple way to test would be from a machine where your app is normally running, to a "telnet ", say using the url info your client normally uses. You should receive in the telnet session an INFO block that will connect a field called connect_urls. If the IPs listed there are all valid and reachable from the client host, then you have nothing to do.
Thanks for the tip. I'll try and telnet and see if I can get some more information.
I updated my k8s deployment to use the ConfigMap
as described in the docs last night and I was seeing the same behavior as previously described. Everything is setup exactly as described in the documentation with the exception of SQL store vs file store:
NOTE I'm killing my pods and they fail a health check so are restarted and will continue to do so until able to connect to NATS
Are there any specific stan.js
client options I should be providing in order to ensure that the client will connect/re-connect to the leader when using FT mode with NATS? Unfortunately I'm still having connection issues and I'm afraid I've exhausted all my options. I'm wondering if this is just a behavior when running an embedded NATS server in conjunction with NATS Streaming
NOTE The client I'm testing with is local and I'm connecting to a k8s cluster which means it does not have access to the internal IP's of the servers running stan. Regardless of that, I'm seeing the same type of behavior on servers (pods) running inside of the cluster which tells me something is up as those servers/pods do have access to those IP's.
I'm thinking I might just take a different route and go with HA setup if I cannot get the client to work with fault tolerance. I want to be mindful of your time so at this point if you feel its better that I just move on I completely understand.
Thanks again.
I am a bit confused when you say "Server", do you mean the NATS server or your server app? Again, if it is one of your app that cannot reconnect, make sure this is not because of the IP it gets. Alternatively, provide the 3 pods public URLs to your app so they can use those to try to reconnect. Also check your Streaming servers logs and make sure that:
The HA setup would possibly have the same issues with the IPs from a client perspective. Also, if you go the HA route, make sure that each node as its own SQL DB store, because unlike with FT (where storage is shared), in clustering mode, each server needs to have its own storage (RAFT + streaming stores).
Are there any specific
stan.js
client options I should be providing in order to ensure that the client will connect/re-connect to the leader when using FT mode with NATS? Unfortunately I'm still having connection issues and I'm afraid I've exhausted all my options. I'm wondering if this is just a behavior when running an embedded NATS server in conjunction with NATS Streaming
If the NATS host:ports are stable on the different servers, there should be nothing to do, when the cluster changes, the client learns about it and retries it. Just to be clear, your maxReconnectAttempts
option should be the default (-1
).
You can also specify all the nats server host:ports on the client (servers
property).
Also - are any of these localhost
or 127.0.0.1
hosts?
Now - while it should be perfectly fine for you to run the NATS server from the embedded nats-streaming-server, have you tried separating the two clusters. If a stan server fails, it shouldn't take out the nats connection.
I'm all set. I was finally able to get this to work properly and with fault tolerance mode.
Just to be clear, your maxReconnectAttempts option should be the default (-1).
This helped with the reconnect. No more dropped connection when killing pods.
Lastly, I was doing something really dumb. I was connecting to the server using http
protocol and not the nats
protocol. Once I updated the url
to nats://<MY_INGRESS_IP>:4222
boom š„ Everything works. Its always the little things that are overlooked and this was it.
Regardless, thank you both again for your help. I learned a lot about NATS and NATS Streaming during this process and I'm looking forward to pushing this out to production.
@demelvin current clients have started to support bare host:port for the URL - this makes sense (with the exception that url
option is a misnomer). This may make it simpler and prevent that type of error.
Hi all š
I'm having an issue with the clients ability to establish an initial connection when running in fault tolerance (FT) mode within a k8s cluster.
Behavior
Result
error
event fires with the following message:close
event fires,process.exit()
is called, server restarts over and over again repeating the above resulted steps until a connection can be established.Expected Result Client connects first time without having to restart multiple times until a connection is established.
Nats Streaming Version Docker Image
nats-streaming:latest
Embedded Nats ServerClient Version
latest (v0.3.3-0)
I've been combing through the docs, github issues, as well as experimenting with various settings on both the deployment and the client. Any ideas on how to solve this?
Thank you in advance for your help.