sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.66k stars 447 forks source link

kubernetes example: superuser password #125

Closed pinootto closed 8 years ago

pinootto commented 8 years ago

When I connect to the postgres instance and create a password for the stolon superuser:

[stolon@stolon-keeper-rc-hwqxd ~]$ psql -h localhost -p 5432 postgres

I get the following error:

[stolon@stolon-keeper-rc-hou7z ~]$ psql -h localhost -p 5432 postgres psql: could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5432? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432?

pinootto commented 8 years ago

It seems that postgres is not running inside the container.

If I manually run /usr/local/bin/run.sh, I get the following output:

[root@stolon-keeper-rc-hou7z bin]# ./run.sh start HOSTNAME=stolon-keeper-rc-hou7z STKEEPER_CLUSTER_NAME=kube-stolon KUBERNETES_PORT=tcp://10.254.0.1:443 KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_SERVICE_PORT=443 KUBERNETES_SERVICE_HOST=10.254.0.1 KEEPER=true STKEEPER_STORE_ENDPOINTS=192.168.33.10:2379 LS_COLORS= NGINX_SERVICE_PORT_8000_TCP_ADDR=10.254.97.124 NGINX_SERVICE_SERVICE_HOST=10.254.97.124 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PWD=/usr/local/bin NGINX_SERVICE_PORT=tcp://10.254.97.124:8000 NGINX_SERVICE_PORT_8000_TCP=tcp://10.254.97.124:8000 NGINX_SERVICE_PORT_8000_TCP_PORT=8000 HOME=/root SHLVL=2 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_SERVICE_PORT_HTTPS=443 NGINX_SERVICE_SERVICE_PORT=8000 NGINX_SERVICE_PORT_8000_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_ADDR=10.254.0.1 STKEEPER_STORE_BACKEND=etcd KUBERNETES_PORT_443_TCP=tcp://10.254.0.1:443 POD_IP=172.17.16.13 STKEEPERDEBUG=true =/usr/bin/env 2016-03-24 06:45:18.368835 [keeper.go:898] W | keeper: both --pg-su-username and --pg-su-password needs to be defined to use pg_rewind 2016-03-24 06:45:18.368954 [keeper.go:904] C | keeper: cannot take exclusive lock on data dir "/stolon-data": file already locked

sgotti commented 8 years ago

@pinootto Can you provide the logs for the stolon-keeper pod (kubectl log $podname)?

The lock error means that the stolon keeper is already running (or the pod will exit). What it's not probably running is postgres. This can happen for various reasons (error talking to etcd cluster, failed initdb etc...).

pinootto commented 8 years ago

[root@localhost kubernetes]# k log stolon-keeper-rc-5okc8 W0324 05:48:13.308432 7226 cmd.go:200] log is DEPRECATED and will be removed in a future version. Use logs instead. start HOSTNAME=stolon-keeper-rc-5okc8 KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT=tcp://10.254.0.1:443 STKEEPER_CLUSTER_NAME=kube-stolon KUBERNETES_SERVICE_PORT=443 KUBERNETES_SERVICE_HOST=10.254.0.1 STKEEPER_STORE_ENDPOINTS=192.168.33.10:2379 KEEPER=true NGINX_SERVICE_PORT_8000_TCP_ADDR=10.254.97.124 NGINX_SERVICE_SERVICE_HOST=10.254.97.124 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PWD=/ NGINX_SERVICE_PORT=tcp://10.254.97.124:8000 NGINX_SERVICE_PORT_8000_TCP=tcp://10.254.97.124:8000 SHLVL=1 HOME=/root NGINX_SERVICE_PORT_8000_TCP_PORT=8000 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_SERVICE_PORT_HTTPS=443 NGINX_SERVICE_SERVICE_PORT=8000 NGINX_SERVICE_PORT_8000_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_ADDR=10.254.0.1 KUBERNETES_PORT_443_TCP=tcp://10.254.0.1:443 STKEEPER_STORE_BACKEND=etcd POD_IP=172.17.16.18 STKEEPERDEBUG=true =/usr/bin/env 2016-03-24 09:47:23.024724 [keeper.go:898] W | keeper: both --pg-su-username and --pg-su-password needs to be defined to use pg_rewind 2016-03-24 09:47:23.025144 [keeper.go:927] I | keeper: generated id: 341474c5 2016-03-24 09:47:23.025300 [keeper.go:934] I | keeper: id: 341474c5 2016-03-24 09:47:23.028038 [keeper.go:361] D | keeper: clusterView: (_cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(_cluster.ProxyConf) Config:(_cluster.NilConfig){RequestTimeout:(_cluster.Duration) SleepInterval:(_cluster.Duration) KeeperFailInterval:(_cluster.Duration) PGReplUser:(_string) PGReplPassword:(_string) MaxStandbysPerSender:(_uint) SynchronousReplication:(_bool) InitWithMultipleKeepers:(_bool) UsePGRewind:(_bool) PGParameters:(_map[string]string)} ChangeTime:(time.Time)0001-01-01 00:00:00 +0000 UTC} 2016-03-24 09:47:23.028080 [keeper.go:364] D | keeper: clusterConfig: (_cluster.Config){RequestTimeout:(time.Duration)10s SleepInterval:(time.Duration)5s KeeperFailInterval:(time.Duration)20s PGReplUser:(string)repluser PGReplPassword:(string)replpassword MaxStandbysPerSender:(uint)3 SynchronousReplication:(bool)false InitWithMultipleKeepers:(bool)false UsePGRewind:(bool)false PGParameters:(map[string]string)map[]} 2016-03-24 09:47:23.028137 [postgresql.go:171] I | postgresql: Stopping database 2016-03-24 09:47:23.038144 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:47:28.039061 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:47:33.039893 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:47:38.041014 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:47:43.042003 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:47:48.043007 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:47:53.043643 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:47:58.044384 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:48:03.044962 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:48:08.045779 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured 2016-03-24 09:48:13.047149 [keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured

pinootto commented 8 years ago

the etcd is running on the master, which has the following IP addresses:

[root@localhost kubernetes]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:7c:4f:9a brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0 valid_lft 53145sec preferred_lft 53145sec inet6 fe80::5054:ff:fe7c:4f9a/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:a1:83:f4 brd ff:ff:ff:ff:ff:ff inet 192.168.33.10/24 brd 192.168.33.255 scope global eth1 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fea1:83f4/64 scope link valid_lft forever preferred_lft forever 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN link/ether 02:42:e7:43:06:96 brd ff:ff:ff:ff:ff:ff inet 172.17.42.1/16 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:e7ff:fe43:696/64 scope link valid_lft forever preferred_lft forever

pinootto commented 8 years ago

[root@localhost kubernetes]# cat stolon-keeper.yaml apiVersion: v1 kind: ReplicationController metadata: name: stolon-keeper-rc spec: replicas: 1 selector: name: stolon-keeper template: metadata: labels: name: stolon-keeper stolon-cluster: "kube-stolon" stolon-keeper: "true" spec: containers:

pinootto commented 8 years ago

from the master:

[root@localhost kubernetes]# k get po -o wide NAME READY STATUS RESTARTS AGE NODE stolon-keeper-rc-5okc8 1/1 Running 0 7m 192.168.33.11 stolon-sentinel-rc-92mpw 1/1 Running 0 10m 192.168.33.11

pinootto commented 8 years ago

IP addresses of the node:

[root@localhost kubernetes]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:7c:4f:9a brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0 valid_lft 76972sec preferred_lft 76972sec inet6 fe80::5054:ff:fe7c:4f9a/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:49:36:22 brd ff:ff:ff:ff:ff:ff inet 192.168.33.11/24 brd 192.168.33.255 scope global eth1 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fe49:3622/64 scope link valid_lft forever preferred_lft forever 4: flannel0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1472 qdisc pfifo_fast state UNKNOWN qlen 500 link/none inet 172.17.16.0/16 scope global flannel0 valid_lft forever preferred_lft forever 5: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1472 qdisc noqueue state UP link/ether 02:42:9d:61:6a:37 brd ff:ff:ff:ff:ff:ff inet 172.17.16.1/24 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:9dff:fe61:6a37/64 scope link valid_lft forever preferred_lft forever 37: vethdd2631b@if36: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1472 qdisc noqueue master docker0 state UP link/ether f6:94:ec:44:be:2d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::f494:ecff:fe44:be2d/64 scope link valid_lft forever preferred_lft forever 39: veth6c41f5d@if38: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1472 qdisc noqueue master docker0 state UP link/ether ca:07:87:c4:75:df brd ff:ff:ff:ff:ff:ff link-netnsid 1 inet6 fe80::c807:87ff:fec4:75df/64 scope link valid_lft forever preferred_lft forever

sgotti commented 8 years ago

[keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured

Looks like the pod cannot connect to etcd at STKEEPER_STORE_ENDPOINTS=192.168.33.10:2379.

You should check if your pod can reach it and etcd is providing the correct --advertise-client-urls (telnet is not enough since you can connect to it but then etcd will provide another advertise url like localhost:2379)

The faster way will be to install etcdctl on the pod (or use a dedicated pod) an use it to test etcd connectivity.

pinootto commented 8 years ago

I think that the pod can connect to etcd.

I get the following logs from the sentinel:

[root@localhost kubernetes]# k logs stolon-sentinel-rc-drwvq start STSENTINEL_KEEPER_KUBE_LABEL_SELECTOR=stolon-cluster=kube-stolon,stolon-keeper=true HOSTNAME=stolon-sentinel-rc-drwvq KUBERNETES_PORT=tcp://10.254.0.1:443 KUBERNETES_PORT_443_TCP_PORT=443 STSENTINEL_DEBUG=true KUBERNETES_SERVICE_PORT=443 KUBERNETES_SERVICE_HOST=10.254.0.1 STSENTINEL_STORE_BACKEND=etcd STSENTINEL_CLUSTER_NAME=kube-stolon NGINX_SERVICE_PORT_8000_TCP_ADDR=10.254.97.124 NGINX_SERVICE_SERVICE_HOST=10.254.97.124 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin SENTINEL=true PWD=/ NGINX_SERVICE_PORT=tcp://10.254.97.124:8000 STSENTINEL_STORE_ENDPOINTS=192.168.33.10:2379 NGINX_SERVICE_PORT_8000_TCP=tcp://10.254.97.124:8000 SHLVL=1 HOME=/root NGINX_SERVICE_PORT_8000_TCP_PORT=8000 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_SERVICE_PORT_HTTPS=443 NGINX_SERVICE_PORT_8000_TCP_PROTO=tcp NGINX_SERVICE_SERVICE_PORT=8000 KUBERNETES_PORT_443_TCP_ADDR=10.254.0.1 KUBERNETES_PORT_443_TCP=tcp://10.254.0.1:443 PODIP=172.17.16.8 =/usr/bin/env 2016-03-28 02:21:11.760155 [sentinel.go:845] I | sentinel: id: 1f823694 2016-03-28 02:21:11.760908 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership 2016-03-28 02:21:11.771495 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState) *2016-03-28 02:21:11.771587 [sentinel.go:743] D | sentinel: clusterView: (_cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(_cluster.ProxyConf) Config:(_cluster.NilConfig){RequestTimeout:(_cluster.Duration) SleepInterval:(_cluster.Duration) KeeperFailInterval:(_cluster.Duration) PGReplUser:(_string) PGReplPassword:(_string) MaxStandbysPerSender:(_uint) SynchronousReplication:(_bool) InitWithMultipleKeepers:(_bool) UsePGRewind:(_bool) PGParameters:(map[string]string)} ChangeTime:(time.Time)0001-01-01 00:00:00 +0000 UTC}* 2016-03-28 02:21:11.771753 [sentinel.go:194] D | sentinel: sentinelInfo: (cluster.SentinelInfo){ID:(string)1f823694 ListenAddress:(string)172.17.16.8 Port:(string)6431} 2016-03-28 02:21:11.771976 [sentinel.go:750] E | sentinel: cannot update sentinel info: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:11.772149 [sentinel.go:107] E | sentinel: election loop error: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:16.773285 [sentinel.go:729] E | sentinel: error retrieving cluster data: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:21.772548 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership 2016-03-28 02:21:21.773395 [sentinel.go:107] E | sentinel: election loop error: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:21.773882 [sentinel.go:729] E | sentinel: error retrieving cluster data: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:26.774536 [sentinel.go:729] E | sentinel: error retrieving cluster data: client: etcd cluster is unavailable or misconfigured

Maybe the problem is the misconfiguration of etcd.

Here is my etcd config on 192.168.33.10:

[root@localhost kubernetes]# cat /etc/etcd/etcd.conf

[member]

ETCD_NAME=default

ETCD_NAME=kube-stolon

ETCD_DATA_DIR="/var/lib/etcd/default.etcd"

ETCD_WAL_DIR=""

ETCD_SNAPSHOT_COUNT="10000"

ETCD_HEARTBEAT_INTERVAL="100"

ETCD_ELECTION_TIMEOUT="1000"

ETCD_LISTEN_PEER_URLS="http://localhost:2380"

ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380" ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"

ETCD_MAX_SNAPSHOTS="5"

ETCD_MAX_WALS="5"

ETCD_CORS=""

#

[cluster]

ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.33.10:2380"

if you use different ETCD_NAME (e.g. test), set ETCD_INITIAL_CLUSTER value for this name, i.e. "test=http://..."

ETCD_INITIAL_CLUSTER="default=http://localhost:2380"

ETCD_INITIAL_CLUSTER="kube-stolon=http://0.0.0.0:2380"

ETCD_INITIAL_CLUSTER_STATE="new"

ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"

ETCD_ADVERTISE_CLIENT_URLS="http://192.168.33.10:2379"

ETCD_DISCOVERY=""

ETCD_DISCOVERY_SRV=""

ETCD_DISCOVERY_FALLBACK="proxy"

ETCD_DISCOVERY_PROXY=""

ETCD_STRICT_RECONFIG_CHECK="false"

#

[proxy]

ETCD_PROXY="off"

ETCD_PROXY_FAILURE_WAIT="5000"

ETCD_PROXY_REFRESH_INTERVAL="30000"

ETCD_PROXY_DIAL_TIMEOUT="1000"

ETCD_PROXY_WRITE_TIMEOUT="5000"

ETCD_PROXY_READ_TIMEOUT="0"

#

[security]

ETCD_CERT_FILE=""

ETCD_KEY_FILE=""

ETCD_CLIENT_CERT_AUTH="false"

ETCD_TRUSTED_CA_FILE=""

ETCD_PEER_CERT_FILE=""

ETCD_PEER_KEY_FILE=""

ETCD_PEER_CLIENT_CERT_AUTH="false"

ETCD_PEER_TRUSTED_CA_FILE=""

#

[logging]

ETCD_DEBUG="false"

examples for -log-package-levels etcdserver=WARNING,security=DEBUG

ETCD_LOG_PACKAGE_LEVELS=""

Can you please have a look at it, in order to see whether there is some wrong configuration?

Has the cluster name to be the same on /etc/etcd/etcd.conf and stolon-sentinel.yaml?

If you need other info to investigate the problem, please let me know.

pinootto commented 8 years ago

I think that the pod can connect to etcd.

I get the following logs from the sentinel:

[root@localhost kubernetes]# k logs stolon-sentinel-rc-drwvq start STSENTINEL_KEEPER_KUBE_LABEL_SELECTOR=stolon-cluster=kube-stolon,stolon-keeper=true HOSTNAME=stolon-sentinel-rc-drwvq KUBERNETES_PORT=tcp://10.254.0.1:443 KUBERNETES_PORT_443_TCP_PORT=443 STSENTINEL_DEBUG=true KUBERNETES_SERVICE_PORT=443 KUBERNETES_SERVICE_HOST=10.254.0.1 STSENTINEL_STORE_BACKEND=etcd STSENTINEL_CLUSTER_NAME=kube-stolon NGINX_SERVICE_PORT_8000_TCP_ADDR=10.254.97.124 NGINX_SERVICE_SERVICE_HOST=10.254.97.124 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin SENTINEL=true PWD=/ NGINX_SERVICE_PORT=tcp://10.254.97.124:8000 STSENTINEL_STORE_ENDPOINTS=192.168.33.10:2379 NGINX_SERVICE_PORT_8000_TCP=tcp://10.254.97.124:8000 SHLVL=1 HOME=/root NGINX_SERVICE_PORT_8000_TCP_PORT=8000 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_SERVICE_PORT_HTTPS=443 NGINX_SERVICE_PORT_8000_TCP_PROTO=tcp NGINX_SERVICE_SERVICE_PORT=8000 KUBERNETES_PORT_443_TCP_ADDR=10.254.0.1 KUBERNETES_PORT_443_TCP=tcp://10.254.0.1:443 PODIP=172.17.16.8 =/usr/bin/env 2016-03-28 02:21:11.760155 [sentinel.go:845] I | sentinel: id: 1f823694 2016-03-28 02:21:11.760908 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership 2016-03-28 02:21:11.771495 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState) *2016-03-28 02:21:11.771587 [sentinel.go:743] D | sentinel: clusterView: (_cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(_cluster.ProxyConf) Config:(_cluster.NilConfig){RequestTimeout:(_cluster.Duration) SleepInterval:(_cluster.Duration) KeeperFailInterval:(_cluster.Duration) PGReplUser:(_string) PGReplPassword:(_string) MaxStandbysPerSender:(_uint) SynchronousReplication:(_bool) InitWithMultipleKeepers:(_bool) UsePGRewind:(_bool) PGParameters:(map[string]string)} ChangeTime:(time.Time)0001-01-01 00:00:00 +0000 UTC}* 2016-03-28 02:21:11.771753 [sentinel.go:194] D | sentinel: sentinelInfo: (cluster.SentinelInfo){ID:(string)1f823694 ListenAddress:(string)172.17.16.8 Port:(string)6431} 2016-03-28 02:21:11.771976 [sentinel.go:750] E | sentinel: cannot update sentinel info: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:11.772149 [sentinel.go:107] E | sentinel: election loop error: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:16.773285 [sentinel.go:729] E | sentinel: error retrieving cluster data: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:21.772548 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership 2016-03-28 02:21:21.773395 [sentinel.go:107] E | sentinel: election loop error: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:21.773882 [sentinel.go:729] E | sentinel: error retrieving cluster data: client: etcd cluster is unavailable or misconfigured 2016-03-28 02:21:26.774536 [sentinel.go:729] E | sentinel: error retrieving cluster data: client: etcd cluster is unavailable or misconfigured

Maybe the problem is the misconfiguration of etcd.

Here is my etcd config on 192.168.33.10:

[root@localhost kubernetes]# cat /etc/etcd/etcd.conf

[member]

ETCD_NAME=default

ETCD_NAME=kube-stolon

ETCD_DATA_DIR="/var/lib/etcd/default.etcd"

ETCD_WAL_DIR=""

ETCD_SNAPSHOT_COUNT="10000"

ETCD_HEARTBEAT_INTERVAL="100"

ETCD_ELECTION_TIMEOUT="1000"

ETCD_LISTEN_PEER_URLS="http://localhost:2380"

ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380" ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"

ETCD_MAX_SNAPSHOTS="5"

ETCD_MAX_WALS="5"

ETCD_CORS=""

#

[cluster]

ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.33.10:2380"

if you use different ETCD_NAME (e.g. test), set ETCD_INITIAL_CLUSTER value for this name, i.e. "test=http://..."

ETCD_INITIAL_CLUSTER="default=http://localhost:2380"

ETCD_INITIAL_CLUSTER="kube-stolon=http://0.0.0.0:2380"

ETCD_INITIAL_CLUSTER_STATE="new"

ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"

ETCD_ADVERTISE_CLIENT_URLS="http://192.168.33.10:2379"

ETCD_DISCOVERY=""

ETCD_DISCOVERY_SRV=""

ETCD_DISCOVERY_FALLBACK="proxy"

ETCD_DISCOVERY_PROXY=""

ETCD_STRICT_RECONFIG_CHECK="false"

#

[proxy]

ETCD_PROXY="off"

ETCD_PROXY_FAILURE_WAIT="5000"

ETCD_PROXY_REFRESH_INTERVAL="30000"

ETCD_PROXY_DIAL_TIMEOUT="1000"

ETCD_PROXY_WRITE_TIMEOUT="5000"

ETCD_PROXY_READ_TIMEOUT="0"

#

[security]

ETCD_CERT_FILE=""

ETCD_KEY_FILE=""

ETCD_CLIENT_CERT_AUTH="false"

ETCD_TRUSTED_CA_FILE=""

ETCD_PEER_CERT_FILE=""

ETCD_PEER_KEY_FILE=""

ETCD_PEER_CLIENT_CERT_AUTH="false"

ETCD_PEER_TRUSTED_CA_FILE=""

#

[logging]

ETCD_DEBUG="false"

examples for -log-package-levels etcdserver=WARNING,security=DEBUG

ETCD_LOG_PACKAGE_LEVELS=""

Can you please have a look at it, in order to see whether there is some wrong configuration?

Has the cluster name to be the same on /etc/etcd/etcd.conf and stolon-sentinel.yaml?

If you need other info to investigate the problem, please let me know.

pinootto commented 8 years ago

here are other info:

from the kubernetes master (192.168.33.10):

[root@localhost kubernetes]# k get po -o wide NAME READY STATUS RESTARTS AGE NODE stolon-sentinel-rc-drwvq 1/1 Running 0 36m 192.168.33.11

then on the pod (192.168.33.11):

[root@localhost bin]# ./stolonctl --cluster-name=kube-stolon --store-backend=etcd --store-endpoints=192.168.33.10:2379 status === Active sentinels ===

No active sentinels cannot get proxies info: client: etcd cluster is unavailable or misconfigured

sgotti commented 8 years ago

Also the sentinel log says that it can't connect to the etcd cluster.

ETCD_ADVERTISE_CLIENT_URLS="http://192.168.33.10:2379"

Your --advertise-client-url looks correct.

Has the cluster name to be the same on /etc/etcd/etcd.conf and stolon-sentinel.yaml?

No, they are completely unrelated. The first is the etcd cluster name, the other is the stolon cluster name, it's used to compute the etcd keys' paths for the stolon cluster data.

My suggestion is to try using etcdctl inside a pod to see if it can connect to the etcd cluster. You can create a new pod or just try copying the etcdctl binary inside one of the stolon pods (you can use docker cp after retrieving the related docker container id). Then run something like etcdctl --debug --endpoint http://192.168.33.10:2379 ls /

If all goes ok you'll see something like this:

start to sync cluster using endpoints(http://192.168.33.10:2379)
cURL Command: curl -X GET http://192.168.33.10:2379/v2/members
got endpoints(http://192.168.33.10:2379) after sync
Cluster-Endpoints: http://192.168.33.10:2379
cURL Command: curl -X GET http://192.168.33.10:2379/v2/keys/?quorum=false&recursive=false&sorted=false
[...]

As you can see there's a first connection to the provided endpoint (it show the equivalent curl commands) where the client retrieves the cluster endpoints (the advertised client urls) and use them to query one of the cluster nodes.

If it doesn't works we can see at which point it breaks.

pinootto commented 8 years ago

OK.

I tried to run the etcdctl command inside the sentinel pod. It seems to work:

[root@stolon-sentinel-rc-82xq7 ~]# ./etcdctl --debug --endpoint http://192.168.33.10:2379 ls /
start to sync cluster using endpoints(http://192.168.33.10:2379)
cURL Command: curl -X GET http://192.168.33.10:2379/v2/members
got endpoints(http://192.168.33.10:2379) after sync
Cluster-Endpoints: http://192.168.33.10:2379
cURL Command: curl -X GET http://192.168.33.10:2379/v2/keys/?quorum=false&recursive=false&sorted=false
/atomic.io
/registry
/stolon
pinootto commented 8 years ago

I don't know why, but now the pod logs are changed:


2016-03-29 00:55:39.787916 [sentinel.go:845] I | sentinel: id: 565bf6df
2016-03-29 00:55:39.803030 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership
2016-03-29 00:55:39.807937 [sentinel.go:95] I | sentinel: sentinel leadership acquired
2016-03-29 00:55:39.808151 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState)<nil>
2016-03-29 00:55:39.808506 [sentinel.go:743] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(*cluster.ProxyConf)<nil> Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> PGReplUser:(*string)<nil> PGReplPassword:(*string)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)<nil> UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)0001-01-01 00:00:00 +0000 UTC}
2016-03-29 00:55:39.808549 [sentinel.go:194] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)565bf6df ListenAddress:(string)172.17.16.7 Port:(string)6431}
2016-03-29 00:55:39.810092 [sentinel.go:244] D | sentinel: running inside kubernetes
2016-03-29 00:55:39.810130 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: cannot retrieve kube api token: open /run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
2016-03-29 00:55:44.812379 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState)<nil>
2016-03-29 00:55:44.812507 [sentinel.go:743] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(*cluster.ProxyConf)<nil> Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> PGReplUser:(*string)<nil> PGReplPassword:(*string)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)<nil> UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)0001-01-01 00:00:00 +0000 UTC}
2016-03-29 00:55:44.812541 [sentinel.go:194] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)565bf6df ListenAddress:(string)172.17.16.7 Port:(string)6431}
2016-03-29 00:55:44.814202 [sentinel.go:244] D | sentinel: running inside kubernetes
2016-03-29 00:55:44.814259 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: cannot retrieve kube api token: open /run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

Now the error is:

2016-03-29 00:55:39.810130 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: cannot retrieve kube api token: open /run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

pinootto commented 8 years ago

From the minion node I see the following:

[root@localhost bin]# ./stolonctl --cluster-name=kube-stolon --store-backend=etcd --store-endpoints=192.168.33.10:2379 config get
error: no clusterview available
[root@localhost bin]# ./stolonctl --cluster-name=kube-stolon --store-backend=etcd --store-endpoints=192.168.33.10:2379 status
=== Active sentinels ===

ID              LISTENADDRESS           LEADER
77db935e        172.17.16.12:6431       true

=== Active proxies ===

No active proxies
cluster data not available: <nil>
pinootto commented 8 years ago

So from the minion node I tried to insert the configuration with the following command, but I get an error:

[root@localhost bin]# echo '{ "request_timeout": "10s", "sleep_interval": "5s", "keeper_fail_interval": "20s", "pg_repl_user": "username", "pg_repl_password": "password", "max_standbys_per_sender": 3, "synchronous_replication": false, "init_with_multiple_keepers": false, "use_pg_rewind": false, "pg_parameters": null }' | ./stolonctl --cluster-name=kube-stolon --store-backend=etcd --store-endpoints=192.168.33.10:2379 config replace -f -
error: error setting config: Put http://172.17.16.12:6431/config/current: EOF
pinootto commented 8 years ago

I added ServiceAccount in the apiserver configuration and I followed the following instructions to create a secret:

1.create ssh key
openssl genrsa -out ca.key 2048
openssl req -x509 -new -nodes -key ca.key -subj "/CN=kalix.com" -days 5000 -out ca.crt
openssl genrsa -out server.key 2048
openssl req -new -key server.key -subj "/CN=kubernetes-master" -out server.csr
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out server.crt -days 5000

2. cp  ./*  /var/run/kubernetes/

3. ## vim /etc/kubernetes/apiserver 

KUBE_API_ARGS="--client_ca_file=/var/run/kubernetes/ca.crt --tls-private-key-file=/var/run/kubernetes/server.key --tls-cert-file=/var/run/kubernetes/server.crt"

systemctl restart kube-apiserver

4. ## vim /etc/kubernetes/controller-manager 

KUBE_CONTROLLER_MANAGER_ARGS="--service_account_private_key_file=/var/run/kubernetes/apiserver.key --root-ca-file=/var/run/kubernetes/ca.crt"

systemctl restart kube-controller-manager 

5. ## then

 kubectl  get serviceaccounts --all-namespaces

NAMESPACE     NAME      SECRETS   AGE
default       default   1         8d

So, now the error has changed to:

2016-03-29 07:52:13.376666 [sentinel.go:845] I | sentinel: id: d345d92d
2016-03-29 07:52:13.427320 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership
2016-03-29 07:52:13.430309 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState)<nil>
2016-03-29 07:52:13.430399 [sentinel.go:743] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(*cluster.ProxyConf)<nil> Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> PGReplUser:(*string)<nil> PGReplPassword:(*string)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)<nil> UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)0001-01-01 00:00:00 +0000 UTC}
2016-03-29 07:52:13.430424 [sentinel.go:194] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)d345d92d ListenAddress:(string)172.17.16.13 Port:(string)6431}
2016-03-29 07:52:13.431938 [sentinel.go:244] D | sentinel: running inside kubernetes
2016-03-29 07:52:13.433858 [sentinel.go:95] I | sentinel: sentinel leadership acquired
2016-03-29 07:52:13.439512 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: Get https://10.254.0.1:443/api/v1/namespaces/default/pods?labelSelector=stolon-cluster%3Dkube-stolon%2Cstolon-keeper%3Dtrue: read tcp 172.17.16.13:41308->10.254.0.1:443: read: connection reset by peer
pinootto commented 8 years ago

Using nmap from the minion node, I found that the 443 port is "filtered":

[root@localhost bin]# nmap 10.254.0.1 -p 443

Starting Nmap 6.40 ( http://nmap.org ) at 2016-03-29 04:24 EDT
Nmap scan report for 10.254.0.1
Host is up (0.00033s latency).
PORT    STATE    SERVICE
443/tcp filtered https

Nmap done: 1 IP address (1 host up) scanned in 0.41 seconds

I don't understand what "filtered" means.

sgotti commented 8 years ago

I don't know why, but now the pod logs are changed:

Now it can connect, probably you changed something in your configuration and made etcd communication work.

2016-03-29 00:55:39.810130 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: cannot retrieve kube api token: open /run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

I added ServiceAccount in the apiserver configuration and I followed the following instructions to create a secret:

Good. We always deployed kubernetes cluster that have serviceAccount enabled by default and from the doc looks that every kubernetes pod should get this secret volume mounted. But looks like that not true. We should probably check it's existence.

[root@localhost bin]# nmap 10.254.0.1 -p 443

from nmap man: Filtered: means that a firewall, filter, or other network obstacle is blocking the port so that Nmap cannot tell whether it is open or closed.

But you should probably do it from a pod (a node can use another api address). Can you just check doing a curl from a pod?

All of this is related to the stolon sentinels using kubernetes api for doing discovery instead of the default discovery made using etcd (where keepers are publishing their discovery info). This is cleaner but can create some problems on some kubernetes deployments (like yours). I'm thinking to just add an option to disable kubernetes discovery and use default store based one.

pinootto commented 8 years ago

I tried from another pod (centos), not the sentinel pod, on the same node:

[root@centos /]# nmap 10.254.0.1 -p 443

Starting Nmap 6.40 ( http://nmap.org ) at 2016-03-30 02:16 UTC
Nmap scan report for 10.254.0.1
Host is up (0.000054s latency).
PORT    STATE SERVICE
443/tcp open  https

Nmap done: 1 IP address (1 host up) scanned in 13.04 seconds

But if I try curl form the same pod (centos), I get an error:

[root@centos /]# curl 10.254.0.1:443   
curl: (56) Recv failure: Connection reset by peer

I cannot connect to the sentinel pod from the master node. I get this error:

[root@localhost kubernetes]# k get po -o wide
NAME                       READY     STATUS    RESTARTS   AGE       NODE
centos                     1/1       Running   0          1d        192.168.33.11
postgres-kpaaf             1/1       Running   0          1d        192.168.33.11
stolon-sentinel-rc-d94y4   1/1       Running   0          17h       192.168.33.11
[root@localhost kubernetes]# k exec stolon-sentinel-rc-d94y4 -it /bin/bash
error: error executing remote command: Error executing command in container: Error executing in Docker Container: -1
pinootto commented 8 years ago

Now I can connect to the sentinel pod:

[root@localhost kubernetes]# k get po -o wide
NAME                         READY     STATUS    RESTARTS   AGE       NODE
centos                       1/1       Running   0          2d        192.168.33.11
kubernetes-dashboard-485pv   1/1       Running   0          17h       192.168.33.11
postgres-kpaaf               1/1       Running   0          2d        192.168.33.11
stolon-sentinel-rc-8h6nt     1/1       Running   0          16h       192.168.33.11

[root@localhost kubernetes]# k exec stolon-sentinel-rc-8h6nt -it /bin/bash

[root@stolon-sentinel-rc-8h6nt /]#

I tried the curl from inside the sentinel pod and I get the following error:

[root@stolon-sentinel-rc-8h6nt tmp]# curl https://10.254.0.1:443 -vv
* Rebuilt URL to: https://10.254.0.1:443/
*   Trying 10.254.0.1...
* Connected to 10.254.0.1 (10.254.0.1) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* NSS error -5938 (PR_END_OF_FILE_ERROR)
* Encountered end of file
* Closing connection 0
curl: (35) Encountered end of file
pinootto commented 8 years ago

I tried the openssl from inside the sentinel pod and I get the following error:

[root@stolon-sentinel-rc-8h6nt /]# openssl s_client -connect 10.254.0.1:443 -msg                           
CONNECTED(00000003)
>>> TLS 1.2 Handshake [length 00ca], ClientHello
    01 00 00 c6 03 03 e7 ca fc d2 1f 77 4f 7b 24 2e
    86 b6 d9 30 8e 2e 1c e4 bd 2f b4 ab 4f 3e f1 a1
    6f 27 e5 81 e5 5f 00 00 5a c0 2f c0 2b c0 27 c0
    23 c0 13 c0 09 00 9c 00 3c 00 2f 00 a2 00 9e 00
    67 00 40 00 33 00 32 00 41 00 45 00 44 c0 30 c0
    2c c0 28 c0 24 c0 14 c0 0a 00 9d 00 3d 00 35 00
    a3 00 9f 00 6b 00 6a 00 39 00 38 00 84 00 88 00
    87 c0 12 c0 08 00 0a 00 16 00 13 c0 11 c0 07 00
    05 00 ff 01 00 00 43 00 0b 00 04 03 00 01 02 00
    0a 00 0a 00 08 00 19 00 18 00 16 00 17 00 23 00
    00 00 0d 00 20 00 1e 06 01 06 02 06 03 05 01 05
    02 05 03 04 01 04 02 04 03 03 01 03 02 03 03 02
    01 02 02 02 03 00 0f 00 01 01
139780265867128:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:s23_lib.c:184:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 207 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
---
pinootto commented 8 years ago
[root@stolon-sentinel-rc-8h6nt /]# openssl s_client -showcerts -debug -connect 10.254.0.1:443 -msg       
CONNECTED(00000003)
write to 0x158adc0 [0x1684200] (207 bytes => 207 (0xCF))
0000 - 16 03 01 00 ca 01 00 00-c6 03 03 d1 ab cb 63 3f   ..............c?
0010 - 31 ee d5 c2 b8 23 8f dd-ae 20 6d df 49 14 1a ba   1....#... m.I...
0020 - ab d5 c3 2c 25 95 eb 6d-16 57 4e 00 00 5a c0 2f   ...,%..m.WN..Z./
0030 - c0 2b c0 27 c0 23 c0 13-c0 09 00 9c 00 3c 00 2f   .+.'.#.......<./
0040 - 00 a2 00 9e 00 67 00 40-00 33 00 32 00 41 00 45   .....g.@.3.2.A.E
0050 - 00 44 c0 30 c0 2c c0 28-c0 24 c0 14 c0 0a 00 9d   .D.0.,.(.$......
0060 - 00 3d 00 35 00 a3 00 9f-00 6b 00 6a 00 39 00 38   .=.5.....k.j.9.8
0070 - 00 84 00 88 00 87 c0 12-c0 08 00 0a 00 16 00 13   ................
0080 - c0 11 c0 07 00 05 00 ff-01 00 00 43 00 0b 00 04   ...........C....
0090 - 03 00 01 02 00 0a 00 0a-00 08 00 19 00 18 00 16   ................
00a0 - 00 17 00 23 00 00 00 0d-00 20 00 1e 06 01 06 02   ...#..... ......
00b0 - 06 03 05 01 05 02 05 03-04 01 04 02 04 03 03 01   ................
00c0 - 03 02 03 03 02 01 02 02-02 03 00 0f 00 01 01      ...............
>>> TLS 1.2 Handshake [length 00ca], ClientHello
    01 00 00 c6 03 03 d1 ab cb 63 3f 31 ee d5 c2 b8
    23 8f dd ae 20 6d df 49 14 1a ba ab d5 c3 2c 25
    95 eb 6d 16 57 4e 00 00 5a c0 2f c0 2b c0 27 c0
    23 c0 13 c0 09 00 9c 00 3c 00 2f 00 a2 00 9e 00
    67 00 40 00 33 00 32 00 41 00 45 00 44 c0 30 c0
    2c c0 28 c0 24 c0 14 c0 0a 00 9d 00 3d 00 35 00
    a3 00 9f 00 6b 00 6a 00 39 00 38 00 84 00 88 00
    87 c0 12 c0 08 00 0a 00 16 00 13 c0 11 c0 07 00
    05 00 ff 01 00 00 43 00 0b 00 04 03 00 01 02 00
    0a 00 0a 00 08 00 19 00 18 00 16 00 17 00 23 00
    00 00 0d 00 20 00 1e 06 01 06 02 06 03 05 01 05
    02 05 03 04 01 04 02 04 03 03 01 03 02 03 03 02
    01 02 02 02 03 00 0f 00 01 01
read from 0x158adc0 [0x1689760] (7 bytes => 0 (0x0))
140527366137720:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:s23_lib.c:184:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 207 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
---
pinootto commented 8 years ago

Maybe it is a problem of SSL certificates on the apiserver, but I don't know how to check it and solve it.

Any idea?

sgotti commented 8 years ago

For posterity #129 added a --discovery-type option. If not specified and the stolon-sentinel detects it's executed inside a k8s pod it'll use the k8s api for discoverying keepers. An user can use --discovery-type=store to force keeper discovery being done using the store. This will avoid problems like the above when the k8s cluster doesn't have security enabled/working.

sgotti commented 8 years ago

In the current master a lot of changes have been done to stolon. The discovery type has been removed and now everything is done using the store. I hope this will remove a lot of confusion. So I'm closing this. Please open any issue against the current master if you're going to try it.

pinootto commented 7 years ago

Thanks, sgotti.