redhat-openstack / openshift-on-openstack

A place to write templates, docs etc. for deploying OpenShift on OpenStack.
Apache License 2.0
136 stars 87 forks source link

metric's self-signed certs don't have cluster_network IPs #177

Open juddmaltin-dell opened 8 years ago

juddmaltin-dell commented 8 years ago

I'm seeing two types of errors in my heapster pod logs:

DNS lookup errors:

W0712 11:34:35.168016       1 node_aggregator.go:44] Failed to find node: node:flannel2-openshift-master-0.flannel9example.com
W0712 11:34:45.110316       1 node_aggregator.go:44] Failed to find node: node:flannel2-openshift-node-025hzih1.flannel9example.com

and Cert errors:

E0711 17:03:45.034823       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "https://10.9.0.4:10250/stats/container/": Post https://10.9.0.4:10250/stats/container/: x509: certificate is valid for 172.30.0.1, 192.168.191.20, 192.168.9.5, not 10.9.0.4

10.9.0.4 - That's my cluster_network. openshift-on-openstack has some extra networks.

[osp_admin@director ~]$ nova list  | grep 10.9.0.4
| d7874418-1b87-404b-bdbb-837eb57fb728 | flannel2-openshift-master-0.flannel9example.com      | ACTIVE  | -          | Running     | flannel2-cluster_network-bwpjwgoahijt=10.9.0.4; flannel2-fixed_network-fj36536xtyns=192.168.9.5, 192.168.191.20 |
[osp_admin@director ~]$

OpenShift is listening on 53:UDP I also have a DNS server on a separate VM for access from outside. /etc/resolv.conf on my OpenShift VMs point to that external DNS server.

Here's an AXFR from that DNS server:

[root@flannel2-infra ~]# dig @localhost flannel9example.com AXFR
;; Connection to ::1#53(::1) for flannel9example.com failed: connection refused.

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.3 <<>> @localhost flannel9example.com AXFR
; (2 servers found)
;; global options: +cmd
flannel9example.com.    86400   IN      SOA     flannel2-infra.flannel9example.com. openshift.flannel9example.com. 1467127738 43200 180 2419200 10800
flannel9example.com.    86400   IN      NS      flannel2-infra.flannel9example.com.
*.cloudapps.flannel9example.com. 86400 IN A     192.168.191.20
cloudforms.flannel9example.com. 86400 IN A      192.168.191.22
flannel2-infra.flannel9example.com. 86400 IN A  192.168.9.4
flannel2-lb.flannel9example.com. 86400 IN A     192.168.191.30
flannel2-openshift-master-0.flannel9example.com. 86400 IN A 192.168.9.5
flannel2-openshift-master-1.flannel9example.com. 86400 IN A 192.168.9.6
flannel2-openshift-node-025hzih1.flannel9example.com. 86400 IN A 192.168.9.10
flannel2-openshift-node-6033jc3s.flannel9example.com. 86400 IN A 192.168.9.9
flannel2-openshift-node-pgsis138.flannel9example.com. 86400 IN A 192.168.9.8
hawkular-metrics.flannel9example.com. 86400 IN A 192.168.191.30
overcloud.flannel9example.com. 86400 IN A       192.168.190.125
undercloud.flannel9example.com. 86400 IN A      192.168.120.61
flannel9example.com.    86400   IN      SOA     flannel2-infra.flannel9example.com. openshift.flannel9example.com. 1467127738 43200 180 2419200 10800
;; Query time: 1 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Jul 12 11:42:16 EDT 2016
;; XFR size: 15 records (messages 1, bytes 568)

Any hints? Do I really have to gen my own certs? Frowny. If I change my /etc/resolv.conf to point to wherever OpenShift is listening 53:UDP, will that fail the system?

Bonus question: Since the hawkular-metrics.example.com name is in the OpenShift DNS, where do I tell the openshift-infra VM's BIND to look for the address? Should I just put the IP address of the router, and hope that it resolves correctly?

Can't this all be done with names?

Many thanks! -judd

juddmaltin-dell commented 8 years ago

Turns out the node certs created by the openshift-on-openstack (or openshift-ansible, dunno yet) are invalid.

https://github.com/openshift/origin-metrics/issues/168

Have a look at that thread. It seems this issue only occurs during direct node access.

Let's debug.

jprovaznik commented 8 years ago

Hi sorry, I overlooked this issue. I guess it's not solved yet, is it? My first guess is that it might be caused by the fact that we use a separate network for inter-pod communication and node's IP for this network is missing in the node's certificate alternative hostnames. I'll see if I can reproduce this locally, but tot sure now what is a simplest reproducer now.