kahkhang / kube-linode

:whale: Provision a Kubernetes/CoreOS cluster on Linode
MIT License
212 stars 31 forks source link

SSL_ERROR_INTERNAL_ERROR_ALERT on Ingress services #72

Closed thefinn93 closed 6 years ago

thefinn93 commented 6 years ago

I just discovered this project, very cool and I'm excited to do use it. I ran it, and according to the terminal output of the script it worked, but it doesn't seem to be quite fully functional. I can use kubectl from my local machine to interact with the cluster and I can see the dashboard with kubectl proxy, but https://kube.example.com doesn't load. Firefox says:

An error occurred during a connection to kube.k8s.janky.solutions. Peer reports it experienced an internal error. Error code: SSL_ERROR_INTERNAL_ERROR_ALERT

This seems to be the case for all of the services that are listed in the README.

The root domain and seemingly any subdomain not listed in the README, I get an invalid, self-signed cert for 8629e935e621618727d9710d574650c6.ec78bb20354254b7355bc5f0e895c417.traefik.default.

Possibly related, possibly unrelated: I noticed when browsing around in the dashboard that the rook-agent pods didn't come up, saying:

Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 173.255.243.5 173.255.244.5 173.230.145.5 Error: failed to start container "rook-agent": Error response from daemon: error while creating mount source path '/usr/libexec/kubernetes/kubelet-plugins/volume/exec': mkdir /usr/libexec/kubernetes: read-only file system Back-off restarting failed container

And one of the rook-ceph-osd pods failed to come up, saying:

network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] Back-off restarting failed container

I'm not really sure what is inter-related, or how to proceed. I'm happy to post any logs that are needed.

kahkhang commented 6 years ago

Hmm interesting, I wonder if this could somehow be related to the nested sub domain and let's encrypt not supporting the issuing of such a certificate, and hence it defaults to traefik.default. If you look at the logs of the traefik pod, it will give you some more information about what went awry during the ACME process. I've only tested it so far on kube.example.com, but not kube.example.example.com.

The rook pod not coming up is of a different error. Seems like the error has come up again (was resolved earlier in https://github.com/rook/rook/issues/1162). I'll debug further and try to debug this.

kahkhang commented 6 years ago

Can I also find out the size of your worker node? Rook is pretty resource intensive (You'll need at least 2gb nodes for each osd pod, see http://docs.ceph.com/docs/jewel/start/hardware-recommendations/), so it might have resulted in the rook-ceph-osd pod failing. The message

network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]

is part of the boot up process because of the self-hosted network bootstrap, but it seems that the pod failed to come up in time (might also be related to resource issues).

thefinn93 commented 6 years ago

Oh, wow I didn't even notice the domain was in that error. I've had no problem issuing certs from Let's Encrypt for nested subdomains before. There's a TON of logs from the traefik pod, but they seem to include things like (hashed) passwords so I'm going to try to debug that by myself a bit before posting the full log (I'll post my findings either way). As for node sizes, my master is a 2GB node and my workers are 1GB, but I could bump that up. I didn't realize that would be a problem. I'm gonna debug the traefik thing, then look into the rook thing. Thanks for the feedback so far.

thefinn93 commented 6 years ago

Okay, a little bit of digging through the logs and got this over and over and over (I grep'd out debug lines to make it a little more readable):

time="2018-01-11T00:04:35Z" level=info msg="Server configuration reloaded on :443"
time="2018-01-11T00:04:35Z" level=info msg="Server configuration reloaded on :80"
time="2018-01-11T00:04:35Z" level=error msg="map[kube.k8s.janky.solutions:[kube.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:35Z" level=error msg="Error getting ACME certificates [kube.k8s.janky.solutions] : Cannot obtain certificates map[kube.k8s.janky.solutions:[kube.k8s.janky.solutions] acme: Could not determine solvers]+v"
time="2018-01-11T00:04:35Z" level=error msg="map[prometheus.k8s.janky.solutions:[prometheus.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:35Z" level=error msg="Error getting ACME certificates [prometheus.k8s.janky.solutions] : Cannot obtain certificates map[prometheus.k8s.janky.solutions:[prometheus.k8s.janky.solutions] acme: Could not determine solvers]+v"
time="2018-01-11T00:04:35Z" level=error msg="map[traefik.k8s.janky.solutions:[traefik.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:35Z" level=error msg="Error getting ACME certificates [traefik.k8s.janky.solutions] : Cannot obtain certificates map[traefik.k8s.janky.solutions:[traefik.k8s.janky.solutions] acme: Could not determine solvers]+v"
time="2018-01-11T00:04:36Z" level=error msg="map[alertmanager.k8s.janky.solutions:[alertmanager.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:36Z" level=error msg="Error getting ACME certificates [alertmanager.k8s.janky.solutions] : Cannot obtain certificates map[alertmanager.k8s.janky.solutions:[alertmanager.k8s.janky.solutions] acme: Could not determine solvers]+v"
time="2018-01-11T00:04:36Z" level=error msg="map[grafana.k8s.janky.solutions:[grafana.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:36Z" level=error msg="Error getting ACME certificates [grafana.k8s.janky.solutions] : Cannot obtain certificates map[grafana.k8s.janky.solutions:[grafana.k8s.janky.solutions] acme: Could not determine solvers]+v"

I'm starting to wonder if it's related to the ongoing Let's Encrypt issue where TLS-SNI validation is disabled.

kahkhang commented 6 years ago

Interesting, I didn't know it was affecting production also, thought it was more of a staging v2 environment. It might most likely be the issue then 😞 (possibly watch this https://github.com/jetstack/kube-lego/issues for more updates in the coming future).

thefinn93 commented 6 years ago

Since it's only the TLS-SNI method that's been disabled, I'm looking into options for DNS or HTTP validation. Traefik doesn't seem to support HTTP validation, but it does support DNS validation and even has a plugin for Linode. Is the Linode API key provided at the beginning stored as a secret in kubernetes? I couldn't find it by skimming the list of secrets, but I may have just missed it.

kahkhang commented 6 years ago

Ah didn't know this feature was available, I'll look more into this. Unfortunately no it is not stored as a secret, but it'll be an interesting idea to do that.

kahkhang commented 6 years ago

Can confirm that I'm getting these errors as well:

time="2018-01-11T16:51:12Z" level=error msg="Error getting ACME certificate for domain [kahkhang.me]: Cannot obtain certificates map[kahkhang.me:[kahkhang.me] acme: Could not determine solvers]+v"
time="2018-01-11T16:51:12Z" level=info msg="Retrieved ACME certificates"
time="2018-01-11T16:51:12Z" level=debug msg="Testing certificate renew..."
time="2018-01-11T16:51:12Z" level=debug msg="LoadCertificateForDomains [kube.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=debug msg="Look for provided certificate to validate [kube.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=debug msg="No provided certificate found for domains [kube.kahkhang.me], get ACME certificate."
time="2018-01-11T16:51:12Z" level=debug msg="Loading ACME certificates [kube.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=error msg="map[kube.kahkhang.me:[kube.kahkhang.me] acme: Could not determine solvers]"
time="2018-01-11T16:51:12Z" level=error msg="Error getting ACME certificates [kube.kahkhang.me] : Cannot obtain certificates map[kube.kahkhang.me:[kube.kahkhang.me] acme: Could not determine solvers]+v"
time="2018-01-11T16:51:12Z" level=debug msg="LoadCertificateForDomains [traefik.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=debug msg="Look for provided certificate to validate [traefik.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=debug msg="No provided certificate found for domains [traefik.kahkhang.me], get ACME certificate."
time="2018-01-11T16:51:12Z" level=debug msg="Loading ACME certificates [traefik.kahkhang.me]..."
time="2018-01-11T16:51:13Z" level=error msg="map[traefik.kahkhang.me:[traefik.kahkhang.me] acme: Could not determine solvers]"
time="2018-01-11T16:51:13Z" level=error msg="Error getting ACME certificates [traefik.kahkhang.me] : Cannot obtain certificates map[traefik.kahkhang.me:[traefik.kahkhang.me] acme: Could not determine solvers]+v"

Traefik seems to have some updates regarding this issue (https://twitter.com/traefikproxy), but I'll need to sieve through them. Hopefully this gets resolved soon, it's an upstream issue.

kahkhang commented 6 years ago

Rook is also crashing because of https://github.com/rook/rook/issues/1330, a fix is on the way.

kahkhang commented 6 years ago

As a workaround for now, to access the kubernetes dashboard, execute kubectl proxy then access http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/

kahkhang commented 6 years ago

Once this is merged https://github.com/containous/traefik/pull/2701 I'll be able to fix this :)

nbiles commented 6 years ago

When do you think this can be fixed. I can't deploy anything now because I can't get traefik to work with any of my domains. I'm wondering if there is something I can do.

kahkhang commented 6 years ago

You can disable https for now by changing the traefik manifest. The fix will be available once a newer version of Traefik is released (probably by Tuesday).