Closed pinootto closed 8 years ago
It seems that postgres is not running inside the container.
If I manually run /usr/local/bin/run.sh, I get the following output:
[root@stolon-keeper-rc-hou7z bin]# ./run.sh start HOSTNAME=stolon-keeper-rc-hou7z STKEEPER_CLUSTER_NAME=kube-stolon KUBERNETES_PORT=tcp://10.254.0.1:443 KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_SERVICE_PORT=443 KUBERNETES_SERVICE_HOST=10.254.0.1 KEEPER=true STKEEPER_STORE_ENDPOINTS=192.168.33.10:2379 LS_COLORS= NGINX_SERVICE_PORT_8000_TCP_ADDR=10.254.97.124 NGINX_SERVICE_SERVICE_HOST=10.254.97.124 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PWD=/usr/local/bin NGINX_SERVICE_PORT=tcp://10.254.97.124:8000 NGINX_SERVICE_PORT_8000_TCP=tcp://10.254.97.124:8000 NGINX_SERVICE_PORT_8000_TCP_PORT=8000 HOME=/root SHLVL=2 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_SERVICE_PORT_HTTPS=443 NGINX_SERVICE_SERVICE_PORT=8000 NGINX_SERVICE_PORT_8000_TCP_PROTO=tcp KUBERNETES_PORT_443_TCP_ADDR=10.254.0.1 STKEEPER_STORE_BACKEND=etcd KUBERNETES_PORT_443_TCP=tcp://10.254.0.1:443 POD_IP=172.17.16.13 STKEEPERDEBUG=true =/usr/bin/env 2016-03-24 06:45:18.368835 [keeper.go:898] W | keeper: both --pg-su-username and --pg-su-password needs to be defined to use pg_rewind 2016-03-24 06:45:18.368954 [keeper.go:904] C | keeper: cannot take exclusive lock on data dir "/stolon-data": file already locked
@pinootto Can you provide the logs for the stolon-keeper pod (kubectl log $podname)?
The lock error means that the stolon keeper is already running (or the pod will exit). What it's not probably running is postgres. This can happen for various reasons (error talking to etcd cluster, failed initdb etc...).
[root@localhost kubernetes]# k log stolon-keeper-rc-5okc8
W0324 05:48:13.308432 7226 cmd.go:200] log is DEPRECATED and will be removed in a future version. Use logs instead.
start
HOSTNAME=stolon-keeper-rc-5okc8
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT=tcp://10.254.0.1:443
STKEEPER_CLUSTER_NAME=kube-stolon
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_HOST=10.254.0.1
STKEEPER_STORE_ENDPOINTS=192.168.33.10:2379
KEEPER=true
NGINX_SERVICE_PORT_8000_TCP_ADDR=10.254.97.124
NGINX_SERVICE_SERVICE_HOST=10.254.97.124
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
NGINX_SERVICE_PORT=tcp://10.254.97.124:8000
NGINX_SERVICE_PORT_8000_TCP=tcp://10.254.97.124:8000
SHLVL=1
HOME=/root
NGINX_SERVICE_PORT_8000_TCP_PORT=8000
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_PORT_HTTPS=443
NGINX_SERVICE_SERVICE_PORT=8000
NGINX_SERVICE_PORT_8000_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_ADDR=10.254.0.1
KUBERNETES_PORT_443_TCP=tcp://10.254.0.1:443
STKEEPER_STORE_BACKEND=etcd
POD_IP=172.17.16.18
STKEEPERDEBUG=true
=/usr/bin/env
2016-03-24 09:47:23.024724 [keeper.go:898] W | keeper: both --pg-su-username and --pg-su-password needs to be defined to use pg_rewind
2016-03-24 09:47:23.025144 [keeper.go:927] I | keeper: generated id: 341474c5
2016-03-24 09:47:23.025300 [keeper.go:934] I | keeper: id: 341474c5
2016-03-24 09:47:23.028038 [keeper.go:361] D | keeper: clusterView: (_cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(_cluster.ProxyConf)
the etcd is running on the master, which has the following IP addresses:
[root@localhost kubernetes]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:7c:4f:9a brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0 valid_lft 53145sec preferred_lft 53145sec inet6 fe80::5054:ff:fe7c:4f9a/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:a1:83:f4 brd ff:ff:ff:ff:ff:ff inet 192.168.33.10/24 brd 192.168.33.255 scope global eth1 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fea1:83f4/64 scope link valid_lft forever preferred_lft forever 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN link/ether 02:42:e7:43:06:96 brd ff:ff:ff:ff:ff:ff inet 172.17.42.1/16 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:e7ff:fe43:696/64 scope link valid_lft forever preferred_lft forever
[root@localhost kubernetes]# cat stolon-keeper.yaml apiVersion: v1 kind: ReplicationController metadata: name: stolon-keeper-rc spec: replicas: 1 selector: name: stolon-keeper template: metadata: labels: name: stolon-keeper stolon-cluster: "kube-stolon" stolon-keeper: "true" spec: containers:
image: 192.168.33.1:5000/sorintlab/stolon:master
#image: sorintlab/stolon:latest
#image: 192.168.33.1:5000/sorintlab/stolon:latest
env:
- name: KEEPER
value: "true"
- name: STKEEPER_CLUSTER_NAME
# TODO(sgotti) Get cluster name from "stoloncluster" label using a downward volume api instead of duplicating the name here
value: "kube-stolon"
- name: STKEEPER_STORE_BACKEND
value: "etcd" # Or consul
- name: STKEEPER_STORE_ENDPOINTS
value: "192.168.33.10:2379"
# Enable debugging
- name: STKEEPER_DEBUG
value: "true"
ports:
- containerPort: 5431
- containerPort: 5432
volumeMounts:
- mountPath: /stolon-data
name: data
volumes:
from the master:
[root@localhost kubernetes]# k get po -o wide NAME READY STATUS RESTARTS AGE NODE stolon-keeper-rc-5okc8 1/1 Running 0 7m 192.168.33.11 stolon-sentinel-rc-92mpw 1/1 Running 0 10m 192.168.33.11
IP addresses of the node:
[root@localhost kubernetes]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:7c:4f:9a brd ff:ff:ff:ff:ff:ff inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0 valid_lft 76972sec preferred_lft 76972sec inet6 fe80::5054:ff:fe7c:4f9a/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 08:00:27:49:36:22 brd ff:ff:ff:ff:ff:ff inet 192.168.33.11/24 brd 192.168.33.255 scope global eth1 valid_lft forever preferred_lft forever inet6 fe80::a00:27ff:fe49:3622/64 scope link valid_lft forever preferred_lft forever 4: flannel0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1472 qdisc pfifo_fast state UNKNOWN qlen 500 link/none inet 172.17.16.0/16 scope global flannel0 valid_lft forever preferred_lft forever 5: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1472 qdisc noqueue state UP link/ether 02:42:9d:61:6a:37 brd ff:ff:ff:ff:ff:ff inet 172.17.16.1/24 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:9dff:fe61:6a37/64 scope link valid_lft forever preferred_lft forever 37: vethdd2631b@if36: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1472 qdisc noqueue master docker0 state UP link/ether f6:94:ec:44:be:2d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fe80::f494:ecff:fe44:be2d/64 scope link valid_lft forever preferred_lft forever 39: veth6c41f5d@if38: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1472 qdisc noqueue master docker0 state UP link/ether ca:07:87:c4:75:df brd ff:ff:ff:ff:ff:ff link-netnsid 1 inet6 fe80::c807:87ff:fec4:75df/64 scope link valid_lft forever preferred_lft forever
[keeper.go:489] E | keeper: error retrieving cluster view: client: etcd cluster is unavailable or misconfigured
Looks like the pod cannot connect to etcd at STKEEPER_STORE_ENDPOINTS=192.168.33.10:2379.
You should check if your pod can reach it and etcd is providing the correct --advertise-client-urls (telnet is not enough since you can connect to it but then etcd will provide another advertise url like localhost:2379)
The faster way will be to install etcdctl on the pod (or use a dedicated pod) an use it to test etcd connectivity.
I think that the pod can connect to etcd.
I get the following logs from the sentinel:
[root@localhost kubernetes]# k logs stolon-sentinel-rc-drwvq
start
STSENTINEL_KEEPER_KUBE_LABEL_SELECTOR=stolon-cluster=kube-stolon,stolon-keeper=true
HOSTNAME=stolon-sentinel-rc-drwvq
KUBERNETES_PORT=tcp://10.254.0.1:443
KUBERNETES_PORT_443_TCP_PORT=443
STSENTINEL_DEBUG=true
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_HOST=10.254.0.1
STSENTINEL_STORE_BACKEND=etcd
STSENTINEL_CLUSTER_NAME=kube-stolon
NGINX_SERVICE_PORT_8000_TCP_ADDR=10.254.97.124
NGINX_SERVICE_SERVICE_HOST=10.254.97.124
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
SENTINEL=true
PWD=/
NGINX_SERVICE_PORT=tcp://10.254.97.124:8000
STSENTINEL_STORE_ENDPOINTS=192.168.33.10:2379
NGINX_SERVICE_PORT_8000_TCP=tcp://10.254.97.124:8000
SHLVL=1
HOME=/root
NGINX_SERVICE_PORT_8000_TCP_PORT=8000
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_PORT_HTTPS=443
NGINX_SERVICE_PORT_8000_TCP_PROTO=tcp
NGINX_SERVICE_SERVICE_PORT=8000
KUBERNETES_PORT_443_TCP_ADDR=10.254.0.1
KUBERNETES_PORT_443_TCP=tcp://10.254.0.1:443
PODIP=172.17.16.8
=/usr/bin/env
2016-03-28 02:21:11.760155 [sentinel.go:845] I | sentinel: id: 1f823694
2016-03-28 02:21:11.760908 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership
2016-03-28 02:21:11.771495 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState)
Maybe the problem is the misconfiguration of etcd.
Here is my etcd config on 192.168.33.10:
[root@localhost kubernetes]# cat /etc/etcd/etcd.conf
ETCD_NAME=default
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380" ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"
#
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.33.10:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.33.10:2379"
#
#
#
Can you please have a look at it, in order to see whether there is some wrong configuration?
Has the cluster name to be the same on /etc/etcd/etcd.conf and stolon-sentinel.yaml?
If you need other info to investigate the problem, please let me know.
I think that the pod can connect to etcd.
I get the following logs from the sentinel:
[root@localhost kubernetes]# k logs stolon-sentinel-rc-drwvq
start
STSENTINEL_KEEPER_KUBE_LABEL_SELECTOR=stolon-cluster=kube-stolon,stolon-keeper=true
HOSTNAME=stolon-sentinel-rc-drwvq
KUBERNETES_PORT=tcp://10.254.0.1:443
KUBERNETES_PORT_443_TCP_PORT=443
STSENTINEL_DEBUG=true
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_HOST=10.254.0.1
STSENTINEL_STORE_BACKEND=etcd
STSENTINEL_CLUSTER_NAME=kube-stolon
NGINX_SERVICE_PORT_8000_TCP_ADDR=10.254.97.124
NGINX_SERVICE_SERVICE_HOST=10.254.97.124
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
SENTINEL=true
PWD=/
NGINX_SERVICE_PORT=tcp://10.254.97.124:8000
STSENTINEL_STORE_ENDPOINTS=192.168.33.10:2379
NGINX_SERVICE_PORT_8000_TCP=tcp://10.254.97.124:8000
SHLVL=1
HOME=/root
NGINX_SERVICE_PORT_8000_TCP_PORT=8000
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_PORT_HTTPS=443
NGINX_SERVICE_PORT_8000_TCP_PROTO=tcp
NGINX_SERVICE_SERVICE_PORT=8000
KUBERNETES_PORT_443_TCP_ADDR=10.254.0.1
KUBERNETES_PORT_443_TCP=tcp://10.254.0.1:443
PODIP=172.17.16.8
=/usr/bin/env
2016-03-28 02:21:11.760155 [sentinel.go:845] I | sentinel: id: 1f823694
2016-03-28 02:21:11.760908 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership
2016-03-28 02:21:11.771495 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState)
Maybe the problem is the misconfiguration of etcd.
Here is my etcd config on 192.168.33.10:
[root@localhost kubernetes]# cat /etc/etcd/etcd.conf
ETCD_NAME=default
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380" ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379"
#
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.33.10:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.33.10:2379"
#
#
#
Can you please have a look at it, in order to see whether there is some wrong configuration?
Has the cluster name to be the same on /etc/etcd/etcd.conf and stolon-sentinel.yaml?
If you need other info to investigate the problem, please let me know.
here are other info:
from the kubernetes master (192.168.33.10):
[root@localhost kubernetes]# k get po -o wide NAME READY STATUS RESTARTS AGE NODE stolon-sentinel-rc-drwvq 1/1 Running 0 36m 192.168.33.11
then on the pod (192.168.33.11):
[root@localhost bin]# ./stolonctl --cluster-name=kube-stolon --store-backend=etcd --store-endpoints=192.168.33.10:2379 status === Active sentinels ===
No active sentinels cannot get proxies info: client: etcd cluster is unavailable or misconfigured
Also the sentinel log says that it can't connect to the etcd cluster.
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.33.10:2379"
Your --advertise-client-url looks correct.
Has the cluster name to be the same on /etc/etcd/etcd.conf and stolon-sentinel.yaml?
No, they are completely unrelated. The first is the etcd cluster name, the other is the stolon cluster name, it's used to compute the etcd keys' paths for the stolon cluster data.
My suggestion is to try using etcdctl
inside a pod to see if it can connect to the etcd cluster. You can create a new pod or just try copying the etcdctl binary inside one of the stolon pods (you can use docker cp after retrieving the related docker container id). Then run something like etcdctl --debug --endpoint http://192.168.33.10:2379 ls /
If all goes ok you'll see something like this:
start to sync cluster using endpoints(http://192.168.33.10:2379)
cURL Command: curl -X GET http://192.168.33.10:2379/v2/members
got endpoints(http://192.168.33.10:2379) after sync
Cluster-Endpoints: http://192.168.33.10:2379
cURL Command: curl -X GET http://192.168.33.10:2379/v2/keys/?quorum=false&recursive=false&sorted=false
[...]
As you can see there's a first connection to the provided endpoint (it show the equivalent curl commands) where the client retrieves the cluster endpoints (the advertised client urls) and use them to query one of the cluster nodes.
If it doesn't works we can see at which point it breaks.
OK.
I tried to run the etcdctl command inside the sentinel pod. It seems to work:
[root@stolon-sentinel-rc-82xq7 ~]# ./etcdctl --debug --endpoint http://192.168.33.10:2379 ls /
start to sync cluster using endpoints(http://192.168.33.10:2379)
cURL Command: curl -X GET http://192.168.33.10:2379/v2/members
got endpoints(http://192.168.33.10:2379) after sync
Cluster-Endpoints: http://192.168.33.10:2379
cURL Command: curl -X GET http://192.168.33.10:2379/v2/keys/?quorum=false&recursive=false&sorted=false
/atomic.io
/registry
/stolon
I don't know why, but now the pod logs are changed:
2016-03-29 00:55:39.787916 [sentinel.go:845] I | sentinel: id: 565bf6df
2016-03-29 00:55:39.803030 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership
2016-03-29 00:55:39.807937 [sentinel.go:95] I | sentinel: sentinel leadership acquired
2016-03-29 00:55:39.808151 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState)<nil>
2016-03-29 00:55:39.808506 [sentinel.go:743] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(*cluster.ProxyConf)<nil> Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> PGReplUser:(*string)<nil> PGReplPassword:(*string)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)<nil> UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)0001-01-01 00:00:00 +0000 UTC}
2016-03-29 00:55:39.808549 [sentinel.go:194] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)565bf6df ListenAddress:(string)172.17.16.7 Port:(string)6431}
2016-03-29 00:55:39.810092 [sentinel.go:244] D | sentinel: running inside kubernetes
2016-03-29 00:55:39.810130 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: cannot retrieve kube api token: open /run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
2016-03-29 00:55:44.812379 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState)<nil>
2016-03-29 00:55:44.812507 [sentinel.go:743] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(*cluster.ProxyConf)<nil> Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> PGReplUser:(*string)<nil> PGReplPassword:(*string)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)<nil> UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)0001-01-01 00:00:00 +0000 UTC}
2016-03-29 00:55:44.812541 [sentinel.go:194] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)565bf6df ListenAddress:(string)172.17.16.7 Port:(string)6431}
2016-03-29 00:55:44.814202 [sentinel.go:244] D | sentinel: running inside kubernetes
2016-03-29 00:55:44.814259 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: cannot retrieve kube api token: open /run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
Now the error is:
2016-03-29 00:55:39.810130 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: cannot retrieve kube api token: open /run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
From the minion node I see the following:
[root@localhost bin]# ./stolonctl --cluster-name=kube-stolon --store-backend=etcd --store-endpoints=192.168.33.10:2379 config get
error: no clusterview available
[root@localhost bin]# ./stolonctl --cluster-name=kube-stolon --store-backend=etcd --store-endpoints=192.168.33.10:2379 status
=== Active sentinels ===
ID LISTENADDRESS LEADER
77db935e 172.17.16.12:6431 true
=== Active proxies ===
No active proxies
cluster data not available: <nil>
So from the minion node I tried to insert the configuration with the following command, but I get an error:
[root@localhost bin]# echo '{ "request_timeout": "10s", "sleep_interval": "5s", "keeper_fail_interval": "20s", "pg_repl_user": "username", "pg_repl_password": "password", "max_standbys_per_sender": 3, "synchronous_replication": false, "init_with_multiple_keepers": false, "use_pg_rewind": false, "pg_parameters": null }' | ./stolonctl --cluster-name=kube-stolon --store-backend=etcd --store-endpoints=192.168.33.10:2379 config replace -f -
error: error setting config: Put http://172.17.16.12:6431/config/current: EOF
I added ServiceAccount in the apiserver configuration and I followed the following instructions to create a secret:
1.create ssh key
openssl genrsa -out ca.key 2048
openssl req -x509 -new -nodes -key ca.key -subj "/CN=kalix.com" -days 5000 -out ca.crt
openssl genrsa -out server.key 2048
openssl req -new -key server.key -subj "/CN=kubernetes-master" -out server.csr
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out server.crt -days 5000
2. cp ./* /var/run/kubernetes/
3. ## vim /etc/kubernetes/apiserver
KUBE_API_ARGS="--client_ca_file=/var/run/kubernetes/ca.crt --tls-private-key-file=/var/run/kubernetes/server.key --tls-cert-file=/var/run/kubernetes/server.crt"
systemctl restart kube-apiserver
4. ## vim /etc/kubernetes/controller-manager
KUBE_CONTROLLER_MANAGER_ARGS="--service_account_private_key_file=/var/run/kubernetes/apiserver.key --root-ca-file=/var/run/kubernetes/ca.crt"
systemctl restart kube-controller-manager
5. ## then
kubectl get serviceaccounts --all-namespaces
NAMESPACE NAME SECRETS AGE
default default 1 8d
So, now the error has changed to:
2016-03-29 07:52:13.376666 [sentinel.go:845] I | sentinel: id: d345d92d
2016-03-29 07:52:13.427320 [sentinel.go:85] I | sentinel: Trying to acquire sentinels leadership
2016-03-29 07:52:13.430309 [sentinel.go:742] D | sentinel: keepersState: (cluster.KeepersState)<nil>
2016-03-29 07:52:13.430399 [sentinel.go:743] D | sentinel: clusterView: (*cluster.ClusterView){Version:(int)0 Master:(string) KeepersRole:(cluster.KeepersRole)map[] ProxyConf:(*cluster.ProxyConf)<nil> Config:(*cluster.NilConfig){RequestTimeout:(*cluster.Duration)<nil> SleepInterval:(*cluster.Duration)<nil> KeeperFailInterval:(*cluster.Duration)<nil> PGReplUser:(*string)<nil> PGReplPassword:(*string)<nil> MaxStandbysPerSender:(*uint)<nil> SynchronousReplication:(*bool)<nil> InitWithMultipleKeepers:(*bool)<nil> UsePGRewind:(*bool)<nil> PGParameters:(*map[string]string)<nil>} ChangeTime:(time.Time)0001-01-01 00:00:00 +0000 UTC}
2016-03-29 07:52:13.430424 [sentinel.go:194] D | sentinel: sentinelInfo: (*cluster.SentinelInfo){ID:(string)d345d92d ListenAddress:(string)172.17.16.13 Port:(string)6431}
2016-03-29 07:52:13.431938 [sentinel.go:244] D | sentinel: running inside kubernetes
2016-03-29 07:52:13.433858 [sentinel.go:95] I | sentinel: sentinel leadership acquired
2016-03-29 07:52:13.439512 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: Get https://10.254.0.1:443/api/v1/namespaces/default/pods?labelSelector=stolon-cluster%3Dkube-stolon%2Cstolon-keeper%3Dtrue: read tcp 172.17.16.13:41308->10.254.0.1:443: read: connection reset by peer
Using nmap from the minion node, I found that the 443 port is "filtered":
[root@localhost bin]# nmap 10.254.0.1 -p 443
Starting Nmap 6.40 ( http://nmap.org ) at 2016-03-29 04:24 EDT
Nmap scan report for 10.254.0.1
Host is up (0.00033s latency).
PORT STATE SERVICE
443/tcp filtered https
Nmap done: 1 IP address (1 host up) scanned in 0.41 seconds
I don't understand what "filtered" means.
I don't know why, but now the pod logs are changed:
Now it can connect, probably you changed something in your configuration and made etcd communication work.
2016-03-29 00:55:39.810130 [sentinel.go:758] E | sentinel: err: failed to get running pods ips: cannot retrieve kube api token: open /run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
I added ServiceAccount in the apiserver configuration and I followed the following instructions to create a secret:
Good. We always deployed kubernetes cluster that have serviceAccount enabled by default and from the doc looks that every kubernetes pod should get this secret volume mounted. But looks like that not true. We should probably check it's existence.
[root@localhost bin]# nmap 10.254.0.1 -p 443
from nmap man: Filtered: means that a firewall, filter, or other network obstacle is blocking the port so that Nmap cannot tell whether it is open or closed.
But you should probably do it from a pod (a node can use another api address). Can you just check doing a curl from a pod?
All of this is related to the stolon sentinels using kubernetes api for doing discovery instead of the default discovery made using etcd (where keepers are publishing their discovery info). This is cleaner but can create some problems on some kubernetes deployments (like yours). I'm thinking to just add an option to disable kubernetes discovery and use default store based one.
I tried from another pod (centos), not the sentinel pod, on the same node:
[root@centos /]# nmap 10.254.0.1 -p 443
Starting Nmap 6.40 ( http://nmap.org ) at 2016-03-30 02:16 UTC
Nmap scan report for 10.254.0.1
Host is up (0.000054s latency).
PORT STATE SERVICE
443/tcp open https
Nmap done: 1 IP address (1 host up) scanned in 13.04 seconds
But if I try curl form the same pod (centos), I get an error:
[root@centos /]# curl 10.254.0.1:443
curl: (56) Recv failure: Connection reset by peer
I cannot connect to the sentinel pod from the master node. I get this error:
[root@localhost kubernetes]# k get po -o wide
NAME READY STATUS RESTARTS AGE NODE
centos 1/1 Running 0 1d 192.168.33.11
postgres-kpaaf 1/1 Running 0 1d 192.168.33.11
stolon-sentinel-rc-d94y4 1/1 Running 0 17h 192.168.33.11
[root@localhost kubernetes]# k exec stolon-sentinel-rc-d94y4 -it /bin/bash
error: error executing remote command: Error executing command in container: Error executing in Docker Container: -1
Now I can connect to the sentinel pod:
[root@localhost kubernetes]# k get po -o wide
NAME READY STATUS RESTARTS AGE NODE
centos 1/1 Running 0 2d 192.168.33.11
kubernetes-dashboard-485pv 1/1 Running 0 17h 192.168.33.11
postgres-kpaaf 1/1 Running 0 2d 192.168.33.11
stolon-sentinel-rc-8h6nt 1/1 Running 0 16h 192.168.33.11
[root@localhost kubernetes]# k exec stolon-sentinel-rc-8h6nt -it /bin/bash
[root@stolon-sentinel-rc-8h6nt /]#
I tried the curl from inside the sentinel pod and I get the following error:
[root@stolon-sentinel-rc-8h6nt tmp]# curl https://10.254.0.1:443 -vv
* Rebuilt URL to: https://10.254.0.1:443/
* Trying 10.254.0.1...
* Connected to 10.254.0.1 (10.254.0.1) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* NSS error -5938 (PR_END_OF_FILE_ERROR)
* Encountered end of file
* Closing connection 0
curl: (35) Encountered end of file
I tried the openssl from inside the sentinel pod and I get the following error:
[root@stolon-sentinel-rc-8h6nt /]# openssl s_client -connect 10.254.0.1:443 -msg
CONNECTED(00000003)
>>> TLS 1.2 Handshake [length 00ca], ClientHello
01 00 00 c6 03 03 e7 ca fc d2 1f 77 4f 7b 24 2e
86 b6 d9 30 8e 2e 1c e4 bd 2f b4 ab 4f 3e f1 a1
6f 27 e5 81 e5 5f 00 00 5a c0 2f c0 2b c0 27 c0
23 c0 13 c0 09 00 9c 00 3c 00 2f 00 a2 00 9e 00
67 00 40 00 33 00 32 00 41 00 45 00 44 c0 30 c0
2c c0 28 c0 24 c0 14 c0 0a 00 9d 00 3d 00 35 00
a3 00 9f 00 6b 00 6a 00 39 00 38 00 84 00 88 00
87 c0 12 c0 08 00 0a 00 16 00 13 c0 11 c0 07 00
05 00 ff 01 00 00 43 00 0b 00 04 03 00 01 02 00
0a 00 0a 00 08 00 19 00 18 00 16 00 17 00 23 00
00 00 0d 00 20 00 1e 06 01 06 02 06 03 05 01 05
02 05 03 04 01 04 02 04 03 03 01 03 02 03 03 02
01 02 02 02 03 00 0f 00 01 01
139780265867128:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:s23_lib.c:184:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 207 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
---
[root@stolon-sentinel-rc-8h6nt /]# openssl s_client -showcerts -debug -connect 10.254.0.1:443 -msg
CONNECTED(00000003)
write to 0x158adc0 [0x1684200] (207 bytes => 207 (0xCF))
0000 - 16 03 01 00 ca 01 00 00-c6 03 03 d1 ab cb 63 3f ..............c?
0010 - 31 ee d5 c2 b8 23 8f dd-ae 20 6d df 49 14 1a ba 1....#... m.I...
0020 - ab d5 c3 2c 25 95 eb 6d-16 57 4e 00 00 5a c0 2f ...,%..m.WN..Z./
0030 - c0 2b c0 27 c0 23 c0 13-c0 09 00 9c 00 3c 00 2f .+.'.#.......<./
0040 - 00 a2 00 9e 00 67 00 40-00 33 00 32 00 41 00 45 .....g.@.3.2.A.E
0050 - 00 44 c0 30 c0 2c c0 28-c0 24 c0 14 c0 0a 00 9d .D.0.,.(.$......
0060 - 00 3d 00 35 00 a3 00 9f-00 6b 00 6a 00 39 00 38 .=.5.....k.j.9.8
0070 - 00 84 00 88 00 87 c0 12-c0 08 00 0a 00 16 00 13 ................
0080 - c0 11 c0 07 00 05 00 ff-01 00 00 43 00 0b 00 04 ...........C....
0090 - 03 00 01 02 00 0a 00 0a-00 08 00 19 00 18 00 16 ................
00a0 - 00 17 00 23 00 00 00 0d-00 20 00 1e 06 01 06 02 ...#..... ......
00b0 - 06 03 05 01 05 02 05 03-04 01 04 02 04 03 03 01 ................
00c0 - 03 02 03 03 02 01 02 02-02 03 00 0f 00 01 01 ...............
>>> TLS 1.2 Handshake [length 00ca], ClientHello
01 00 00 c6 03 03 d1 ab cb 63 3f 31 ee d5 c2 b8
23 8f dd ae 20 6d df 49 14 1a ba ab d5 c3 2c 25
95 eb 6d 16 57 4e 00 00 5a c0 2f c0 2b c0 27 c0
23 c0 13 c0 09 00 9c 00 3c 00 2f 00 a2 00 9e 00
67 00 40 00 33 00 32 00 41 00 45 00 44 c0 30 c0
2c c0 28 c0 24 c0 14 c0 0a 00 9d 00 3d 00 35 00
a3 00 9f 00 6b 00 6a 00 39 00 38 00 84 00 88 00
87 c0 12 c0 08 00 0a 00 16 00 13 c0 11 c0 07 00
05 00 ff 01 00 00 43 00 0b 00 04 03 00 01 02 00
0a 00 0a 00 08 00 19 00 18 00 16 00 17 00 23 00
00 00 0d 00 20 00 1e 06 01 06 02 06 03 05 01 05
02 05 03 04 01 04 02 04 03 03 01 03 02 03 03 02
01 02 02 02 03 00 0f 00 01 01
read from 0x158adc0 [0x1689760] (7 bytes => 0 (0x0))
140527366137720:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:s23_lib.c:184:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 207 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
---
Maybe it is a problem of SSL certificates on the apiserver, but I don't know how to check it and solve it.
Any idea?
For posterity #129 added a --discovery-type
option. If not specified and the stolon-sentinel detects it's executed inside a k8s pod it'll use the k8s api for discoverying keepers. An user can use --discovery-type=store
to force keeper discovery being done using the store. This will avoid problems like the above when the k8s cluster doesn't have security enabled/working.
In the current master a lot of changes have been done to stolon. The discovery type has been removed and now everything is done using the store. I hope this will remove a lot of confusion. So I'm closing this. Please open any issue against the current master if you're going to try it.
Thanks, sgotti.
When I connect to the postgres instance and create a password for the stolon superuser:
[stolon@stolon-keeper-rc-hwqxd ~]$ psql -h localhost -p 5432 postgres
I get the following error:
[stolon@stolon-keeper-rc-hou7z ~]$ psql -h localhost -p 5432 postgres psql: could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 5432? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432?