nats-io / nats-streaming-operator

NATS Streaming Operator
Apache License 2.0
174 stars 44 forks source link

`STREAM: Failed to start: failed to join Raft group` after updating kubernetes version #44

Open tahsinrahman opened 5 years ago

tahsinrahman commented 5 years ago

we're running nats streaming operator in gke, it was running fine. But after we updated kubernetes version. nats clients are not able to connect to nats server.

$ kubectl get natsclusters.nats.io nats-cluster -o yaml
apiVersion: nats.io/v1alpha2
kind: NatsCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"nats.io/v1alpha2","kind":"NatsCluster","metadata":{"annotations":{},"name":"nats-cluster","namespace":"default"},"spec":{"size":3}}
  creationTimestamp: "2019-05-09T05:56:01Z"
  generation: 4
  name: nats-cluster
  namespace: default
  resourceVersion: "34972339"
  selfLink: /apis/nats.io/v1alpha2/namespaces/default/natsclusters/nats-cluster
  uid: 1f98955b-721f-11e9-b8dc-42010a8001a9
spec:
  size: 3
status:
  conditions:
  - reason: scaling cluster from 2 to 3 peers
    transitionTime: "2019-06-26T11:04:40Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T11:05:53Z"
    type: Ready
  - reason: scaling cluster from 2 to 3 peers
    transitionTime: "2019-06-26T11:05:54Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T11:06:06Z"
    type: Ready
  - reason: scaling cluster from 2 to 3 peers
    transitionTime: "2019-06-26T19:47:43Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T19:48:12Z"
    type: Ready
  - reason: scaling cluster from 1 to 3 peers
    transitionTime: "2019-06-26T19:51:14Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T19:52:07Z"
    type: Ready
  - reason: scaling cluster from 2 to 3 peers
    transitionTime: "2019-06-26T19:57:13Z"
    type: ScalingUp
  - reason: current state matches desired state
    transitionTime: "2019-06-26T19:57:38Z"
    type: Ready
  currentVersion: 1.4.0
  size: 3
$ kubectl get natsstreamingclusters.streaming.nats.io pharmer-cluster -o yaml
apiVersion: streaming.nats.io/v1alpha1
kind: NatsStreamingCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"streaming.nats.io/v1alpha1","kind":"NatsStreamingCluster","metadata":{"annotations":{},"name":"pharmer-cluster","namespace":"default"},"spec":{"natsSvc":"nats-cluster","size":3}}
  creationTimestamp: "2019-05-09T05:58:23Z"
  generation: 1
  name: pharmer-cluster
  namespace: default
  resourceVersion: "19344769"
  selfLink: /apis/streaming.nats.io/v1alpha1/namespaces/default/natsstreamingclusters/pharmer-cluster
  uid: 747d5286-721f-11e9-b8dc-42010a8001a9
spec:
  natsSvc: nats-cluster
  size: 3
$ kubectl logs -f pharmer-cluster-1
[1] 2019/06/27 04:43:47.580129 [INF] STREAM: Starting nats-streaming-server[pharmer-cluster] version 0.11.2
[1] 2019/06/27 04:43:47.580191 [INF] STREAM: ServerID: ul5c5zg9XBBSq3PMBuRn4j
[1] 2019/06/27 04:43:47.580196 [INF] STREAM: Go version: go1.11.1
[1] 2019/06/27 04:43:47.604905 [INF] STREAM: Recovering the state...
[1] 2019/06/27 04:43:47.605177 [INF] STREAM: No recovered state
[1] 2019/06/27 04:43:47.605272 [INF] STREAM: Cluster Node ID : "pharmer-cluster-1"
[1] 2019/06/27 04:43:47.605287 [INF] STREAM: Cluster Log Path: pharmer-cluster/"pharmer-cluster-1"
[1] 2019/06/27 04:43:52.714067 [INF] STREAM: Shutting down.
[1] 2019/06/27 04:43:52.714455 [FTL] STREAM: Failed to start: failed to join Raft group pharmer-cluster

From nats client

$ kubectl logs -f pharmer-clsuter-868bd8d465-xwv77
stan: connect request timeout
$ kubectl logs -f nats-cluster-1                                                                                                                                                                             
[1] 2019/06/26 19:51:55.159570 [INF] Starting nats-server version 1.4.0                                                                                                                                             
[1] 2019/06/26 19:51:55.159628 [INF] Git commit [ce2df36]                                                                                                                                                           
[1] 2019/06/26 19:51:55.159856 [INF] Starting http monitor on 0.0.0.0:8222                                                                                                                                          
[1] 2019/06/26 19:51:55.159907 [INF] Listening for client connections on 0.0.0.0:4222                                                                                                                               
[1] 2019/06/26 19:51:55.159915 [INF] Server is ready                                                                                                                                                                
[1] 2019/06/26 19:51:55.160129 [INF] Listening for route connections on 0.0.0.0:6222                                                                                                                                
[1] 2019/06/26 19:51:55.169644 [INF] 10.60.3.20:6222 - rid:1 - Route connection created                                                                              
[1] 2019/06/26 19:51:55.173249 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-1.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:55.173271 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:56.183583 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:56.183612 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-1.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:56.565355 [INF] 10.60.3.20:52526 - rid:2 - Route connection created
[1] 2019/06/26 19:51:57.190034 [INF] 10.60.1.16:53246 - rid:4 - Route connection created
[1] 2019/06/26 19:51:57.190107 [INF] 10.60.1.16:6222 - rid:3 - Route connection created
[1] 2019/06/26 19:51:57.195326 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:58.207202 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:51:59.227256 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:00.240231 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:01.251089 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:02.262939 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:03.273854 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:03.409542 [INF] 10.60.1.19:51194 - rid:5 - Route connection created
[1] 2019/06/26 19:52:04.291668 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:05.305117 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:06.348944 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-3.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:52:07.360903 [INF] 10.60.1.19:6222 - rid:10 - Route connection created
[1] 2019/06/26 19:55:57.369605 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:55:59.370067 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:01.370584 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:03.371988 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:05.372566 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:07.373118 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:09.373677 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:11.374126 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:13.374572 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:15.375083 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:17.375585 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:19.376117 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:21.376767 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:23.377261 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:25.377780 [ERR] Error trying to connect to route: dial tcp 10.60.3.20:6222: i/o timeout
[1] 2019/06/26 19:56:26.384664 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:27.390153 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:28.397100 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:29.402200 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:30.407839 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:31.417194 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:32.423944 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:33.429647 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:34.436051 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:35.441762 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:36.452196 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
[1] 2019/06/26 19:56:37.458081 [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.default.svc on 10.63.240.10:53: no such host
tahsinrahman commented 5 years ago

If i scale down nats streaming cluster to size 1, the pod pharmer-cluster-1 runs.

$ kubectl logs -f pharmer-cluster-1
[1] 2019/06/27 05:30:26.817748 [INF] STREAM: Starting nats-streaming-server[pharmer-cluster] version 0.12.2
[1] 2019/06/27 05:30:26.817803 [INF] STREAM: ServerID: pleessm07DDglxq2tYFk5b
[1] 2019/06/27 05:30:26.817808 [INF] STREAM: Go version: go1.11.6
[1] 2019/06/27 05:30:26.817811 [INF] STREAM: Git commit: [4489c46]
[1] 2019/06/27 05:30:26.835108 [INF] STREAM: Recovering the state...
[1] 2019/06/27 05:30:26.835336 [INF] STREAM: No recovered state
[1] 2019/06/27 05:30:27.086859 [INF] STREAM: Message store is FILE
[1] 2019/06/27 05:30:27.086882 [INF] STREAM: Store location: store
[1] 2019/06/27 05:30:27.086941 [INF] STREAM: ---------- Store Limits ----------
[1] 2019/06/27 05:30:27.086945 [INF] STREAM: Channels:                  100 *
[1] 2019/06/27 05:30:27.086948 [INF] STREAM: --------- Channels Limits --------
[1] 2019/06/27 05:30:27.086951 [INF] STREAM:   Subscriptions:          1000 *
[1] 2019/06/27 05:30:27.086954 [INF] STREAM:   Messages     :       1000000 *
[1] 2019/06/27 05:30:27.086956 [INF] STREAM:   Bytes        :     976.56 MB *
[1] 2019/06/27 05:30:27.086959 [INF] STREAM:   Age          :     unlimited *
[1] 2019/06/27 05:30:27.086962 [INF] STREAM:   Inactivity   :     unlimited *
[1] 2019/06/27 05:30:27.086964 [INF] STREAM: ----------------------------------
[1] 2019/06/27 05:31:04.613144 [INF] STREAM: Channel "create-cluster" has been created
[1] 2019/06/27 05:31:04.616178 [INF] STREAM: Channel "delete-cluster" has been created
[1] 2019/06/27 05:31:04.618912 [INF] STREAM: Channel "retry-cluster" has been created

but when i scale up to size 2/3, the first pod remains running, but other pods fails

$ kubectl logs -f pharmer-cluster-2
[1] 2019/06/27 05:39:05.591052 [INF] STREAM: Starting nats-streaming-server[pharmer-cluster] version 0.12.2
[1] 2019/06/27 05:39:05.591103 [INF] STREAM: ServerID: EZj3gVhakNMkk7k81uEP9K
[1] 2019/06/27 05:39:05.591107 [INF] STREAM: Go version: go1.11.6
[1] 2019/06/27 05:39:05.591110 [INF] STREAM: Git commit: [4489c46]
[1] 2019/06/27 05:39:05.610442 [INF] STREAM: Recovering the state...
[1] 2019/06/27 05:39:05.610731 [INF] STREAM: No recovered state
[1] 2019/06/27 05:39:05.610840 [INF] STREAM: Cluster Node ID : "pharmer-cluster-2"
[1] 2019/06/27 05:39:05.610846 [INF] STREAM: Cluster Log Path: pharmer-cluster/"pharmer-cluster-2"
[1] 2019/06/27 05:39:10.720020 [INF] STREAM: Shutting down.
[1] 2019/06/27 05:39:10.720534 [FTL] STREAM: Failed to start: failed to join Raft group pharmer-cluster
VFe commented 4 years ago

Having this exact same issue on EKS, except it happens without any upgrade. What kubernetes version did you upgrade to?(I'm on 1.12 experiencing this)

tahsinrahman commented 4 years ago

we upgraded to 1.13.6-gke.13

VFe commented 4 years ago

Been having this issue on 1.11, 1.12, and 1.13 in EKS. Been digging and not really sure what the root cause is.

charles-d-burton commented 4 years ago

I'm having this issue on 1.14 with K3S as well.

orefalo commented 4 years ago

same with k3s v1.17.4+k3s1