nats-io / nats-streaming-operator

NATS Streaming Operator
Apache License 2.0
174 stars 44 forks source link

not stable #4

Closed YoninL closed 5 years ago

YoninL commented 5 years ago

kubernets: v1.9.5 nats operator: latest nats streaming operator: latest

after started nats cluster as the readme, everything seems fine, but when starting 3 stan pods using below command:

echo '
---
apiVersion: "streaming.nats.io/v1alpha1"
kind: "NatsStreamingCluster"
metadata:
  name: "example-stan"
spec:
  size: 3
  natsSvc: "example-nats"
' | kubectl -n nats-io apply -f -

none of the 3 pods can connect to nats cluster, error message is like:

[1] 2018/08/02 06:52:48.194822 [INF] STREAM: Starting nats-streaming-server[example-stan] version 0.10.2
[1] 2018/08/02 06:52:48.194920 [INF] STREAM: ServerID: 5bQLEaI0IxgiQ8GbWYkIzr
[1] 2018/08/02 06:52:48.194927 [INF] STREAM: Go version: go1.10.3
[1] 2018/08/02 06:52:50.195774 [INF] STREAM: Shutting down.
[1] 2018/08/02 06:52:50.195843 [FTL] STREAM: Failed to start: nats: no servers available for connection

sometimes one pod has errors:

[1] 2018/08/02 06:52:55.956946 [INF] STREAM: Starting nats-streaming-server[example-stan] version 0.10.2
[1] 2018/08/02 06:52:55.956982 [INF] STREAM: ServerID: qwzK7kMOw6Dxz91FiutahR
[1] 2018/08/02 06:52:55.956988 [INF] STREAM: Go version: go1.10.3
[1] 2018/08/02 06:52:55.974620 [INF] STREAM: Recovering the state...
[1] 2018/08/02 06:52:55.974827 [INF] STREAM: No recovered state
[1] 2018/08/02 06:52:55.974921 [INF] STREAM: Cluster Node ID : qwzK7kMOw6Dxz91Fiutald
[1] 2018/08/02 06:52:55.974928 [INF] STREAM: Cluster Log Path: example-stan/qwzK7kMOw6Dxz91Fiutald
[1] 2018/08/02 06:53:01.082008 [INF] STREAM: Shutting down.
[1] 2018/08/02 06:53:01.082246 [FTL] STREAM: Failed to start: failed to join Raft group example-stan

if I start stan pods one by one, sometimes the first can connect, sometimes two.

YoninL commented 5 years ago

I noticed that the nats operator restarted a few times.

nats-operator-bb6856fbd-tm2fl              0/1       CrashLoopBackOff    5          39m

the logs is:

time="2018-08-02T07:06:29Z" level=info msg="nats-operator Version: 0.2.3-v1alpha2+git"
time="2018-08-02T07:06:29Z" level=info msg="Git SHA: d88048a"
time="2018-08-02T07:06:29Z" level=info msg="Go Version: go1.9"
time="2018-08-02T07:06:29Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-08-02T07:06:29Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"nats-io\", Name:\"nats-operator\", UID:\"4677238c-961b-11e8-9e22-0e8d5bae827c\", APIVersion:\"v1\", ResourceVersion:\"47875736\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' nats-operator-bb6856fbd-tm2fl became leader"
time="2018-08-02T07:06:29Z" level=info msg="finding existing clusters..." pkg=controller
time="2018-08-02T07:06:29Z" level=info msg="starts running from watch version: 47875737" pkg=controller
time="2018-08-02T07:06:29Z" level=info msg="start running..." cluster-name=example-nats pkg=cluster
time="2018-08-02T07:06:29Z" level=info msg="start watching at 47875737" pkg=controller
time="2018-08-02T07:07:45Z" level=info msg="apiserver closed watch stream, retrying after 5s..." pkg=controller
time="2018-08-02T07:07:50Z" level=info msg="start watching at 47875737" pkg=controller
time="2018-08-02T07:09:35Z" level=info msg="apiserver closed watch stream, retrying after 5s..." pkg=controller
time="2018-08-02T07:09:40Z" level=info msg="start watching at 47875737" pkg=controller
time="2018-08-02T07:11:34Z" level=info msg="apiserver closed watch stream, retrying after 5s..." pkg=controller
time="2018-08-02T07:11:39Z" level=info msg="start watching at 47875737" pkg=controller
time="2018-08-02T07:12:42Z" level=info msg="apiserver closed watch stream, retrying after 5s..." pkg=controller
time="2018-08-02T07:12:47Z" level=info msg="start watching at 47875737" pkg=controller
time="2018-08-02T07:13:58Z" level=info msg="apiserver closed watch stream, retrying after 5s..." pkg=controller
time="2018-08-02T07:14:03Z" level=info msg="start watching at 47875737" pkg=controller
time="2018-08-02T07:15:05Z" level=info msg="apiserver closed watch stream, retrying after 5s..." pkg=controller
E0802 07:15:05.734121       1 leaderelection.go:240] error retrieving resource lock nats-io/nats-operator: Get https://10.235.0.1:443/api/v1/namespaces/nats-io/endpoints/nats-operator: unexpected EOF
time="2018-08-02T07:15:10Z" level=info msg="start watching at 47875737" pkg=controller
time="2018-08-02T07:15:10Z" level=info msg="start watching at 47883860" pkg=controller
E0802 07:15:23.001413       1 leaderelection.go:274] Failed to update lock: Put https://10.235.0.1:443/api/v1/namespaces/nats-io/endpoints/nats-operator: read tcp 10.235.67.50:56802->10.235.0.1:443: read: connection reset by peer
time="2018-08-02T07:15:23Z" level=info msg="received signal: broken pipe"
wallyqs commented 5 years ago

Thanks, could you post some of the logs and kubectl describe from the nats-streaming-operator if possible? I'll take a look

YoninL commented 5 years ago

nats streaming operator logs:

time="2018-08-02T06:33:49Z" level=info msg="Starting NATS Streaming Operator v0.1.0"
time="2018-08-02T06:33:49Z" level=info msg="Go Version: go1.9"
time="2018-08-02T06:33:49Z" level=info msg="operator-sdk Version: 0.0.5+git"
time="2018-08-02T06:33:49Z" level=info msg="Watching streaming.nats.io/v1alpha1, NatsStreamingCluster, nats-io, 5"
time="2018-08-02T06:34:06Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:11Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:16Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:21Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:26Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:31Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:36Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:41Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:46Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:51Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:34:56Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:35:01Z" level=info msg="Missing pods for \"example-stan\" cluster (1/3), creating 2 pods..."
time="2018-08-02T06:35:06Z" level=info msg="Missing pods for \"example-stan\" cluster (1/3), creating 2 pods..."
time="2018-08-02T06:38:07Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:38:11Z" level=info msg="Missing pods for \"example-stan\" cluster (1/3), creating 2 pods..."
time="2018-08-02T06:39:45Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:39:46Z" level=info msg="Missing pods for \"example-stan\" cluster (1/3), creating 2 pods..."
time="2018-08-02T06:41:41Z" level=info msg="Missing pods for \"example-stan\" cluster (1/3), creating 2 pods..."
time="2018-08-02T06:42:51Z" level=info msg="Missing pods for \"example-stan\" cluster (2/3), creating 1 pods..."
time="2018-08-02T06:48:46Z" level=info msg="Too many pods for \"example-stan\" cluster (3/2), removing 1 pods..."
time="2018-08-02T06:49:12Z" level=info msg="Too many pods for \"example-stan\" cluster (2/1), removing 1 pods..."
time="2018-08-02T06:49:26Z" level=info msg="Missing pods for \"example-stan\" cluster (1/2), creating 1 pods..."
time="2018-08-02T06:50:09Z" level=info msg="Missing pods for \"example-stan\" cluster (2/3), creating 1 pods..."
time="2018-08-02T06:50:26Z" level=info msg="Missing pods for \"example-stan\" cluster (2/3), creating 1 pods..."
time="2018-08-02T06:52:46Z" level=info msg="Missing pods for \"example-stan\" cluster (0/3), creating 3 pods..."
time="2018-08-02T06:52:53Z" level=info msg="Missing pods for \"example-stan\" cluster (2/3), creating 1 pods..."
time="2018-08-02T06:54:01Z" level=info msg="Missing pods for \"example-stan\" cluster (2/3), creating 1 pods..."

describe result:

Name:           nats-streaming-operator-7698769f5f-psthg
Namespace:      nats-io
Node:           ip-10-250-81-178.ec2.internal/10.250.81.178
Start Time:     Thu, 02 Aug 2018 06:33:47 +0000
Labels:         name=nats-streaming-operator
                pod-template-hash=3254325919
Annotations:    cilium.io/identity=46229
Status:         Running
IP:             10.235.70.74
Controlled By:  ReplicaSet/nats-streaming-operator-7698769f5f
Containers:
  nats-streaming-operator:
    Container ID:   docker://5d0c3ac2b35f23871394a7f3ad41661cce1ce2874d9c877dc218d9f8c6843121
    Image:          synadia/nats-streaming-operator:v0.1.0-v1alpha1
    Image ID:       docker-pullable://synadia/nats-streaming-operator@sha256:c514ca725db9a3f09b8a23b4d5ddcce298d8c81132088be8d1898d216442e93e
    Port:           <none>
    State:          Running
      Started:      Thu, 02 Aug 2018 06:33:49 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      MY_POD_NAMESPACE:  nats-io (v1:metadata.namespace)
      WATCH_NAMESPACE:   nats-io (v1:metadata.namespace)
      MY_POD_NAME:       nats-streaming-operator-7698769f5f-psthg (v1:metadata.name)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from nats-streaming-operator-token-669hd (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  nats-streaming-operator-token-669hd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nats-streaming-operator-token-669hd
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:          <none>