kontena / pharos-cluster

Pharos - The Kubernetes Distribution
https://k8spharos.dev/
Apache License 2.0
311 stars 43 forks source link

apis/metrics.k8s.io/v1beta1 - HTTP 503 #618

Closed wolfedale closed 5 years ago

wolfedale commented 5 years ago

When upgraded to the latest version of pharos-cluster I'm getting this error from my pipeline:

==> Configuring addons ...
==> Enabling addon my-ingress-nginx
K8s::Error::ServiceUnavailable : GET /apis/metrics.k8s.io/v1beta1 => HTTP 503 Service Unavailable: service unavailable
ERROR: Job failed: exit code 1

I'm not sure where this metrics came from, I have nothing special in my ingress.

jakolehm commented 5 years ago

Could you check if metrics-server is actually running in kube-system namespace?

wolfedale commented 5 years ago

Yes, looks like it's running

metrics-server-57b998f5fc-mpg9w 1/1 Running 0 13m

wolfedale commented 5 years ago

Ok, this is really strange. I run my pipeline second time and now it's working. But I did it 3-4 times before and it was failing. Maybe indeed the metrics-server was starting or for some reason it was not ready before.

wolfedale commented 5 years ago

It happened again, in fact I can reproduce it every time when I'm creating new cluster. I also tracked the issue, which is like you suggested not running metrics-server. Looks like it my metrics-server is starting a while and when it's ready and I will re-run pharos then it's working. So I guess the solution might be to check if metrics-server is running and wait around 1 minute for it.

jakolehm commented 5 years ago

I have now seen this too, created #628 to brute-force it through. I think it might be related to the speed of network deployment.. meaning: if weave/calico deployment goes up slowly then metrics-server most likely will throw errors for a while. @jnummelin WDYT?

jnummelin commented 5 years ago

Not sure if it's network related, or just a bug/feature on kube api as we've seen with CRDs too. But no matter the cause, I don't currently see any other way than retrying it.

jakolehm commented 5 years ago

1.3.3 includes couple fixes for this, @wolfedale could you test it & report the results?

kdomanski commented 5 years ago

I'm running into the same issue during initial cluster creation. I can reproduce with v1.3.2 and v1.3.3 but not with earlier versions. Now bisecting between v1.3.1 and v1.3.2.

jakolehm commented 5 years ago

@kdomanski v1.3.3 should retry (with backoff) ... it didn't help here?

kdomanski commented 5 years ago

Right, bisect points to 93b32a63b5352662992c3c22643c416bfa521f38

kdomanski commented 5 years ago

v1.3.3 retries, but each retry yields the same result. Probably the instance I'm running on is so slow that the retry is not enough. Also, the logs of the apiserver mention API rate limiting.

voneiden commented 5 years ago

Any news on this? Unable to create cluster with >1.3.1 like kdomanski, it hangs on the configuring metrics server step. Running CentOS-7-x86_64-Minimal-1804

jakolehm commented 5 years ago

@voneiden @kdomanski do you have any worker nodes in cluster.yml?

voneiden commented 5 years ago

Yes, one worker. With 1.3.1 it goes up without issues.

jakolehm commented 5 years ago

@voneiden could you check if metrics-server deployment is actually running in the cluster?

voneiden commented 5 years ago

Everything seems to start

$ kubectl -n kube-system get pods
NAME                                  READY     STATUS    RESTARTS   AGE
coredns-69c75956f7-qwn5n              1/1       Running   0          2m
coredns-69c75956f7-zv5r8              1/1       Running   0          2m
etcd-kube-master                      1/1       Running   8          1m
kube-apiserver-kube-master            1/1       Running   0          1m
kube-controller-manager-kube-master   1/1       Running   0          1m
kube-proxy-bhl86                      1/1       Running   0          2m
kube-proxy-g547j                      1/1       Running   0          2m
kube-scheduler-kube-master            1/1       Running   0          1m
metrics-server-57b998f5fc-94wvk       1/1       Running   0          2m
pharos-proxy-kube-worker              1/1       Running   0          1m
weave-net-58d2s                       2/2       Running   0          2m
weave-net-c22lz                       2/2       Running   1          2m
$ kubectl -n kube-system get svc
NAME             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
kube-dns         ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP   6m
metrics-server   ClusterIP   10.104.224.225   <none>        443/TCP         5m

The metrics-server service ClusterIP responds to curl from the worker node. From master node I get 'no route to host'.

In comparison to 1.3.1, by using curl I get "connection refused" on the master node and 'no route to host' from worker node. Not sure if relevant.

jakolehm commented 5 years ago

@voneiden check logs from the weave-net pods, this starts to sound like network issue.

voneiden commented 5 years ago

Definitely relevant, seems to boil down at least in my case to not having set trusted_subnets - weave couldn't operate in my cluster properly without it. With that set, 1.3.3 starts fine (metrics step takes a bit longer than in 1.3.1, as was pointed out earlier).

jnummelin commented 5 years ago

If enabling trusted subnets fixes it, it really sounds like IPSec ESP is blocked between the nodes

jakolehm commented 5 years ago

@kdomanski could you check if trusted-subnets helps?

kdomanski commented 5 years ago

@jakolehm I can no longer reproduce the issue. The metrics server is now coming up before the retry times out.