nats-io / nats-streaming-operator

NATS Streaming Operator
Apache License 2.0
174 stars 44 forks source link

Nats streaming cluster failing on kubernetes #72

Open vsadriano opened 4 years ago

vsadriano commented 4 years ago

I'm trying settup a Nats Streaming cluster with three nodes on local kubernetes following the operator docs and I'm getting continuos connection fail from same pods.

As you can see, stan-cluster-poc-1 became cluster leader.

[1] 2020/07/06 19:25:28.972786 [INF] STREAM: Starting nats-streaming-server[stan-cluster-poc] version 0.18.0
[1] 2020/07/06 19:25:28.972914 [INF] STREAM: ServerID: S2CN7Xkm9RPjaSmk17QhUu
[1] 2020/07/06 19:25:28.972921 [INF] STREAM: Go version: go1.14.4
[1] 2020/07/06 19:25:28.972924 [INF] STREAM: Git commit: [026e3a6]
[1] 2020/07/06 19:25:29.076044 [INF] STREAM: Recovering the state...
[1] 2020/07/06 19:25:29.086609 [INF] STREAM: No recovered state
[1] 2020/07/06 19:25:29.092785 [INF] STREAM: Cluster Node ID : "stan-cluster-poc-1"
[1] 2020/07/06 19:25:29.092886 [INF] STREAM: Cluster Log Path: /persistence/stan/raft/stan-cluster-poc-1
[1] 2020/07/06 19:25:29.152058 [INF] STREAM: raft: initial configuration: index=0 servers=[]
[1] 2020/07/06 19:25:29.153059 [INF] STREAM: raft: entering follower state: follower="Node at stan-cluster-poc."stan-cluster-poc-1".stan-cluster-poc [Follower]" leader=
[1] 2020/07/06 19:25:29.158236 [DBG] STREAM: Bootstrapping Raft group stan-cluster-poc as seed node
[1] 2020/07/06 19:25:29.166935 [DBG] STREAM: Discover subject:           _STAN.discover.stan-cluster-poc
[1] 2020/07/06 19:25:29.166977 [DBG] STREAM: Publish subject:            _STAN.pub.stan-cluster-poc.>
[1] 2020/07/06 19:25:29.166982 [DBG] STREAM: Subscribe subject:          _STAN.sub.stan-cluster-poc
[1] 2020/07/06 19:25:29.166985 [DBG] STREAM: Subscription Close subject: _STAN.subclose.stan-cluster-poc
[1] 2020/07/06 19:25:29.166988 [DBG] STREAM: Unsubscribe subject:        _STAN.unsub.stan-cluster-poc
[1] 2020/07/06 19:25:29.166991 [DBG] STREAM: Close subject:              _STAN.close.stan-cluster-poc
[1] 2020/07/06 19:25:29.170852 [INF] STREAM: Message store is RAFT_FILE
[1] 2020/07/06 19:25:29.171036 [INF] STREAM: Store location: /persistence/stan/stan-cluster-poc-1
[1] 2020/07/06 19:25:29.171376 [INF] STREAM: ---------- Store Limits ----------
[1] 2020/07/06 19:25:29.171513 [INF] STREAM: Channels:                  100 *
[1] 2020/07/06 19:25:29.171580 [INF] STREAM: --------- Channels Limits --------
[1] 2020/07/06 19:25:29.172067 [INF] STREAM:   Subscriptions:          1000 *
[1] 2020/07/06 19:25:29.172346 [INF] STREAM:   Messages     :       1000000 *
[1] 2020/07/06 19:25:29.172706 [INF] STREAM:   Bytes        :     976.56 MB *
[1] 2020/07/06 19:25:29.172915 [INF] STREAM:   Age          :     unlimited *
[1] 2020/07/06 19:25:29.173035 [INF] STREAM:   Inactivity   :     unlimited *
[1] 2020/07/06 19:25:29.173215 [INF] STREAM: ----------------------------------
[1] 2020/07/06 19:25:33.103290 [WRN] STREAM: raft: heartbeat timeout reached, starting election: last-leader=
[1] 2020/07/06 19:25:33.103356 [INF] STREAM: raft: entering candidate state: node="Node at stan-cluster-poc."stan-cluster-poc-1".stan-cluster-poc [Candidate]" term=2
[1] 2020/07/06 19:25:33.121018 [DBG] STREAM: raft: votes: needed=1
[1] 2020/07/06 19:25:33.121087 [DBG] STREAM: raft: vote granted: from="stan-cluster-poc-1" term=2 tally=1
[1] 2020/07/06 19:25:33.121244 [INF] STREAM: raft: election won: tally=1
[1] 2020/07/06 19:25:33.121276 [INF] STREAM: raft: entering leader state: leader="Node at stan-cluster-poc."stan-cluster-poc-1".stan-cluster-poc [Leader]"
[1] 2020/07/06 19:25:33.121463 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2020/07/06 19:25:33.147520 [INF] STREAM: finished leader promotion actions
[1] 2020/07/06 19:25:33.147612 [INF] STREAM: Streaming Server is ready

However, it fails on establish connection with one or more nodes depending of cluster nodes number (e.g.):

[1] 2020/07/06 19:34:45.125722 [WRN] STREAM: raft: failed to contact: server-id="stan-cluster-poc-3" time=1.000833233s
[1] 2020/07/06 19:34:46.071788 [WRN] STREAM: raft: failed to contact: server-id="stan-cluster-poc-3" time=1.946740647s
[1] 2020/07/06 19:34:46.413845 [ERR] STREAM: raft: failed to heartbeat to: peer=stan-cluster-poc."stan-cluster-poc-3".stan-cluster-poc error="nats: timeout"
[1] 2020/07/06 19:34:54.349017 [ERR] STREAM: raft: failed to appendEntries to: peer="{Voter "stan-cluster-poc-3" stan-cluster-poc."stan-cluster-poc-3".stan-cluster-poc}" error="natslog: read timeout"

On stan-cluster-poc-2 I received the warning bellow:

[1] 2020/07/06 19:25:33.240858 [WRN] STREAM: raft: failed to get previous log: previous-index=4 last-index=0 error="log not found"

On stan-cluster-poc-3 I received the warning bellow:

[1] 2020/07/06 19:25:34.340570 [WRN] STREAM: raft: failed to get previous log: previous-index=5 last-index=0 error="log not found"

Context

OSX version 10.14.6 
docker 19.03.8
 kubernetes version 1.16.5 
persistent volume with storageClassName “local-storage” and ReadWriteOnce mode nats operator 0.7.2 
nats-server version 2.1.7 nats streaming operator 0.3.0-v1alpha1
 nats-streaming-server version 0.18.0