Closed thefinn93 closed 6 years ago
Hmm interesting, I wonder if this could somehow be related to the nested sub domain and let's encrypt not supporting the issuing of such a certificate, and hence it defaults to traefik.default. If you look at the logs of the traefik pod, it will give you some more information about what went awry during the ACME process. I've only tested it so far on kube.example.com, but not kube.example.example.com.
The rook pod not coming up is of a different error. Seems like the error has come up again (was resolved earlier in https://github.com/rook/rook/issues/1162). I'll debug further and try to debug this.
Can I also find out the size of your worker node? Rook is pretty resource intensive (You'll need at least 2gb nodes for each osd pod, see http://docs.ceph.com/docs/jewel/start/hardware-recommendations/), so it might have resulted in the rook-ceph-osd pod failing. The message
network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized]
is part of the boot up process because of the self-hosted network bootstrap, but it seems that the pod failed to come up in time (might also be related to resource issues).
Oh, wow I didn't even notice the domain was in that error. I've had no problem issuing certs from Let's Encrypt for nested subdomains before. There's a TON of logs from the traefik pod, but they seem to include things like (hashed) passwords so I'm going to try to debug that by myself a bit before posting the full log (I'll post my findings either way). As for node sizes, my master is a 2GB node and my workers are 1GB, but I could bump that up. I didn't realize that would be a problem. I'm gonna debug the traefik thing, then look into the rook thing. Thanks for the feedback so far.
Okay, a little bit of digging through the logs and got this over and over and over (I grep'd out debug lines to make it a little more readable):
time="2018-01-11T00:04:35Z" level=info msg="Server configuration reloaded on :443"
time="2018-01-11T00:04:35Z" level=info msg="Server configuration reloaded on :80"
time="2018-01-11T00:04:35Z" level=error msg="map[kube.k8s.janky.solutions:[kube.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:35Z" level=error msg="Error getting ACME certificates [kube.k8s.janky.solutions] : Cannot obtain certificates map[kube.k8s.janky.solutions:[kube.k8s.janky.solutions] acme: Could not determine solvers]+v"
time="2018-01-11T00:04:35Z" level=error msg="map[prometheus.k8s.janky.solutions:[prometheus.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:35Z" level=error msg="Error getting ACME certificates [prometheus.k8s.janky.solutions] : Cannot obtain certificates map[prometheus.k8s.janky.solutions:[prometheus.k8s.janky.solutions] acme: Could not determine solvers]+v"
time="2018-01-11T00:04:35Z" level=error msg="map[traefik.k8s.janky.solutions:[traefik.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:35Z" level=error msg="Error getting ACME certificates [traefik.k8s.janky.solutions] : Cannot obtain certificates map[traefik.k8s.janky.solutions:[traefik.k8s.janky.solutions] acme: Could not determine solvers]+v"
time="2018-01-11T00:04:36Z" level=error msg="map[alertmanager.k8s.janky.solutions:[alertmanager.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:36Z" level=error msg="Error getting ACME certificates [alertmanager.k8s.janky.solutions] : Cannot obtain certificates map[alertmanager.k8s.janky.solutions:[alertmanager.k8s.janky.solutions] acme: Could not determine solvers]+v"
time="2018-01-11T00:04:36Z" level=error msg="map[grafana.k8s.janky.solutions:[grafana.k8s.janky.solutions] acme: Could not determine solvers]"
time="2018-01-11T00:04:36Z" level=error msg="Error getting ACME certificates [grafana.k8s.janky.solutions] : Cannot obtain certificates map[grafana.k8s.janky.solutions:[grafana.k8s.janky.solutions] acme: Could not determine solvers]+v"
I'm starting to wonder if it's related to the ongoing Let's Encrypt issue where TLS-SNI validation is disabled.
Interesting, I didn't know it was affecting production also, thought it was more of a staging v2 environment. It might most likely be the issue then 😞 (possibly watch this https://github.com/jetstack/kube-lego/issues for more updates in the coming future).
Since it's only the TLS-SNI method that's been disabled, I'm looking into options for DNS or HTTP validation. Traefik doesn't seem to support HTTP validation, but it does support DNS validation and even has a plugin for Linode. Is the Linode API key provided at the beginning stored as a secret in kubernetes? I couldn't find it by skimming the list of secrets, but I may have just missed it.
Ah didn't know this feature was available, I'll look more into this. Unfortunately no it is not stored as a secret, but it'll be an interesting idea to do that.
Can confirm that I'm getting these errors as well:
time="2018-01-11T16:51:12Z" level=error msg="Error getting ACME certificate for domain [kahkhang.me]: Cannot obtain certificates map[kahkhang.me:[kahkhang.me] acme: Could not determine solvers]+v"
time="2018-01-11T16:51:12Z" level=info msg="Retrieved ACME certificates"
time="2018-01-11T16:51:12Z" level=debug msg="Testing certificate renew..."
time="2018-01-11T16:51:12Z" level=debug msg="LoadCertificateForDomains [kube.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=debug msg="Look for provided certificate to validate [kube.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=debug msg="No provided certificate found for domains [kube.kahkhang.me], get ACME certificate."
time="2018-01-11T16:51:12Z" level=debug msg="Loading ACME certificates [kube.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=error msg="map[kube.kahkhang.me:[kube.kahkhang.me] acme: Could not determine solvers]"
time="2018-01-11T16:51:12Z" level=error msg="Error getting ACME certificates [kube.kahkhang.me] : Cannot obtain certificates map[kube.kahkhang.me:[kube.kahkhang.me] acme: Could not determine solvers]+v"
time="2018-01-11T16:51:12Z" level=debug msg="LoadCertificateForDomains [traefik.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=debug msg="Look for provided certificate to validate [traefik.kahkhang.me]..."
time="2018-01-11T16:51:12Z" level=debug msg="No provided certificate found for domains [traefik.kahkhang.me], get ACME certificate."
time="2018-01-11T16:51:12Z" level=debug msg="Loading ACME certificates [traefik.kahkhang.me]..."
time="2018-01-11T16:51:13Z" level=error msg="map[traefik.kahkhang.me:[traefik.kahkhang.me] acme: Could not determine solvers]"
time="2018-01-11T16:51:13Z" level=error msg="Error getting ACME certificates [traefik.kahkhang.me] : Cannot obtain certificates map[traefik.kahkhang.me:[traefik.kahkhang.me] acme: Could not determine solvers]+v"
Traefik seems to have some updates regarding this issue (https://twitter.com/traefikproxy), but I'll need to sieve through them. Hopefully this gets resolved soon, it's an upstream issue.
Rook is also crashing because of https://github.com/rook/rook/issues/1330, a fix is on the way.
As a workaround for now, to access the kubernetes dashboard, execute kubectl proxy
then access http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/
Once this is merged https://github.com/containous/traefik/pull/2701 I'll be able to fix this :)
When do you think this can be fixed. I can't deploy anything now because I can't get traefik to work with any of my domains. I'm wondering if there is something I can do.
You can disable https for now by changing the traefik manifest. The fix will be available once a newer version of Traefik is released (probably by Tuesday).
I just discovered this project, very cool and I'm excited to do use it. I ran it, and according to the terminal output of the script it worked, but it doesn't seem to be quite fully functional. I can use kubectl from my local machine to interact with the cluster and I can see the dashboard with
kubectl proxy
, but https://kube.example.com doesn't load. Firefox says:This seems to be the case for all of the services that are listed in the README.
The root domain and seemingly any subdomain not listed in the README, I get an invalid, self-signed cert for
8629e935e621618727d9710d574650c6.ec78bb20354254b7355bc5f0e895c417.traefik.default
.Possibly related, possibly unrelated: I noticed when browsing around in the dashboard that the rook-agent pods didn't come up, saying:
And one of the rook-ceph-osd pods failed to come up, saying:
I'm not really sure what is inter-related, or how to proceed. I'm happy to post any logs that are needed.