travelaudience / aerospike-operator

Manages Aerospike clusters atop Kubernetes, automating their creation and administration.
Apache License 2.0
27 stars 5 forks source link

Heartbeat connection to wrong Cluster IP when "replicationFactor: 2" #4

Open afshinpaydar opened 5 years ago

afshinpaydar commented 5 years ago

Second aerospike cluster doesn't become up because of heartbeat connectivity issue:

cat docs/examples/20-aerospike-cluster.yml 
apiVersion: aerospike.travelaudience.com/v1alpha2
kind: AerospikeCluster
metadata:
  name: as-cluster-0
spec:
  version: "4.2.0.10"
  nodeCount: 3
  namespaces:
  - name: as-namespace-0
    replicationFactor: 3
    memorySize: 1G
    defaultTTL: 0s
    storage:
      type: file
      size: 1G
      storageClassName: glusterfs-storage
oc get pod -o wide
NAME                                  READY     STATUS    RESTARTS   AGE       IP             NODE                   NOMINATED NODE
aerospike-operator-55c4c4fc58-vqgks   1/1       Running   0          33s       10.130.1.156   node2.soshyant.local   <none>
as-cluster-0-0                        2/2       Running   0          21s       10.129.1.176   node1.soshyant.local   <none>
as-cluster-0-1                        0/2       Pending   0          3s        <none>         <none>                 <none>
oc logs as-cluster-0-0 -c aerospike-server

Apr 20 2019 05:25:55 GMT: INFO (as): (as.c:372) initializing services...
--
  | Apr 20 2019 05:25:55 GMT: INFO (tsvc): (thr_tsvc.c:136) 4 transaction queues: starting 4 threads per queue
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:800) updated fabric published address list to {10.129.1.176:3001}
  | Apr 20 2019 05:25:55 GMT: INFO (partition): (partition_balance.c:196) {as-namespace-0} 4096 partitions: found 4096 absent, 0 stored
  | Apr 20 2019 05:25:55 GMT: INFO (hb): (hb.c:5490) updated heartbeat published address list to {10.129.1.176:3002}
  | Apr 20 2019 05:25:55 GMT: INFO (batch): (batch.c:732) starting 4 batch-index-threads
  | Apr 20 2019 05:25:55 GMT: INFO (batch): (thr_batch.c:373) starting 4 batch-threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:452) starting 8 fabric send threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:469) starting 16 fabric rw channel recv threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric ctrl channel recv threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric bulk channel recv threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric meta channel recv threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:475) starting fabric accept thread
  | Apr 20 2019 05:25:55 GMT: INFO (hb): (hb.c:7003) initializing mesh heartbeat socket: 0.0.0.0:3002
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (socket.c:702) Started fabric endpoint 0.0.0.0:3001
  | Apr 20 2019 05:25:55 GMT: INFO (hb): (hb.c:7032) mtu of the network is 1450
  | Apr 20 2019 05:25:55 GMT: INFO (hb): (socket.c:702) Started mesh heartbeat endpoint 0.0.0.0:3002
  | Apr 20 2019 05:25:55 GMT: INFO (nsup): (thr_nsup.c:1096) starting namespace supervisor threads
  | Apr 20 2019 05:25:55 GMT: INFO (demarshal): (thr_demarshal.c:886) starting 4 demarshal threads
  | Apr 20 2019 05:25:55 GMT: INFO (demarshal): (socket.c:702) Started client endpoint 0.0.0.0:3000
  | Apr 20 2019 05:25:55 GMT: INFO (info-port): (thr_info_port.c:300) starting info port thread
  | Apr 20 2019 05:25:55 GMT: INFO (info-port): (socket.c:702) Started info endpoint 0.0.0.0:3003
  | Apr 20 2019 05:25:55 GMT: INFO (as): (as.c:416) service ready: soon there will be cake!

.
.
.

Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:171) NODE-ID ae90b0f2b5b13701 CLUSTER-SIZE 1
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:247)    cluster-clock: skew-ms 0
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:277)    system-memory: free-kbytes 3251196 free-pct 40 heap-kbytes (1093695,1094224,1124352) heap-efficiency-pct 97.3
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:291)    in-progress: tsvc-q 0 info-q 0 nsup-delete-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:313)    fds: proto (0,87,87) heartbeat (0,0,0) fabric (0,0,0)
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:322)    heartbeat-received: self 0 foreign 0
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:353)    fabric-bytes-per-second: bulk (0,0) ctrl (0,0) meta (0,0) rw (0,0)
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:408) {as-namespace-0} objects: all 0 master 0 prole 0 non-replica 0
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:469) {as-namespace-0} migrations: complete
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:497) {as-namespace-0} memory-usage: total-bytes 0 index-bytes 0 sindex-bytes 0 used-pct 0.00
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:536) {as-namespace-0} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00
oc logs as-cluster-0-1 -c aerospike-server

Apr 20 2019 05:26:12 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric ctrl channel recv threads
Apr 20 2019 05:26:12 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric bulk channel recv threads
Apr 20 2019 05:26:12 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric meta channel recv threads
Apr 20 2019 05:26:12 GMT: INFO (fabric): (fabric.c:475) starting fabric accept thread
Apr 20 2019 05:26:12 GMT: INFO (hb): (hb.c:7003) initializing mesh heartbeat socket: 0.0.0.0:3002
Apr 20 2019 05:26:12 GMT: INFO (fabric): (socket.c:702) Started fabric endpoint 0.0.0.0:3001
Apr 20 2019 05:26:12 GMT: INFO (hb): (hb.c:7032) mtu of the network is 1450
Apr 20 2019 05:26:12 GMT: INFO (hb): (socket.c:702) Started mesh heartbeat endpoint 0.0.0.0:3002
Apr 20 2019 05:26:12 GMT: INFO (nsup): (thr_nsup.c:1096) starting namespace supervisor threads
Apr 20 2019 05:26:12 GMT: INFO (demarshal): (thr_demarshal.c:886) starting 4 demarshal threads
Apr 20 2019 05:26:12 GMT: WARNING (socket): (socket.c:746) Timeout while connecting
Apr 20 2019 05:26:12 GMT: WARNING (socket): (socket.c:814) Error while connecting socket to 172.30.46.109:3002
Apr 20 2019 05:26:12 GMT: WARNING (hb): (hb.c:4845) could not create heartbeat connection to node {172.30.46.109:3002}
Apr 20 2019 05:26:12 GMT: INFO (demarshal): (socket.c:702) Started client endpoint 0.0.0.0:3000
Apr 20 2019 05:26:12 GMT: INFO (info-port): (thr_info_port.c:300) starting info port thread
Apr 20 2019 05:26:12 GMT: INFO (info-port): (socket.c:702) Started info endpoint 0.0.0.0:3003
Apr 20 2019 05:26:12 GMT: INFO (as): (as.c:416) service ready: soon there will be cake!
Apr 20 2019 05:26:14 GMT: INFO (clustering): (clustering.c:6345) principal node - forming new cluster with succession list: ad61545e60f70194
Apr 20 2019 05:26:14 GMT: INFO (clustering): (clustering.c:5784) applied new cluster key 96d3085cbe4d
Apr 20 2019 05:26:14 GMT: INFO (clustering): (clustering.c:5786) applied new succession list ad61545e60f70194
Apr 20 2019 05:26:14 GMT: INFO (clustering): (clustering.c:5788) applied cluster size 1
Apr 20 2019 05:26:14 GMT: INFO (exchange): (exchange.c:2159) data exchange started with cluster key 96d3085cbe4d
Apr 20 2019 05:26:14 GMT: INFO (exchange): (exchange.c:2989) received commit command from principal node ad61545e60f70194
Apr 20 2019 05:26:14 GMT: INFO (exchange): (exchange.c:2948) data exchange completed with cluster key 96d3085cbe4d
Apr 20 2019 05:26:14 GMT: INFO (partition): (partition_balance.c:915) {as-namespace-0} replication factor is 1
Apr 20 2019 05:26:14 GMT: INFO (partition): (partition_balance.c:887) {as-namespace-0} rebalanced: expected-migrations (0,0) expected-signals 0 fresh-partitions 4096
Apr 20 2019 05:26:15 GMT: WARNING (socket): (socket.c:755) Error while connecting: 113 (No route to host)
oc get svc -o wide
NAME                                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE       SELECTOR
aerospike-operator                                       ClusterIP   172.30.46.109   <none>        443/TCP                      55s       app=aerospike-operator
as-cluster-0                                             ClusterIP   None            <none>        3000/TCP,3002/TCP,9145/TCP   43s       app=aerospike,cluster=as-cluster-0
glusterfs-dynamic-bf1e525f-632c-11e9-95f6-005056afc8ad   ClusterIP   172.30.90.194   <none>        1/TCP                        37s       <none>
glusterfs-dynamic-c9be2905-632c-11e9-95f6-005056afc8ad   ClusterIP   172.30.169.99   <none>        1/TCP                        19s       <none>
pires commented 5 years ago

I honestly can't figure out what may be wrong with the logs above. Also, and very unfortunately, I don't have experience with Openshift to be able to help. I know other people in the community have had issues with it, when trying to use this (and other) operator(s) we maintain.