traefik / mesh

Traefik Mesh - Simpler Service Mesh
https://traefik.io/traefik-mesh
Apache License 2.0
2.03k stars 141 forks source link

Whoami example not working after cluster restart #384

Closed riker09 closed 4 years ago

riker09 commented 4 years ago

Yesterday I did a full wipe of my local K3S installation. I have successfully tested the example afterwards. I could deploy my own Helm chart and use the Mesh, however I have encountered an issue (traffic was routed to the wrong service) and had to call it a day.

After starting my cluster today I verified that all pods have started successfully. The whoami example is not working anymore.

$ kubectl -n whoami exec whoami-client -- curl -Lv whoami.whoami.maesh
* Expire in 0 ms for 6 (transfer 0x55de5b1c8620)
* Expire in 1 ms for 1 (transfer 0x55de5b1c8620)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Expire in 0 ms for 1 (transfer 0x55de5b1c8620)
* Expire in 1 ms for 1 (transfer 0x55de5b1c8620)
* Expire in 0 ms for 1 (transfer 0x55de5b1c8620)
[...]
* Expire in 0 ms for 1 (transfer 0x55de5b1c8620)
* Could not resolve host: whoami.whoami.maesh
* Expire in 0 ms for 1 (transfer 0x55de5b1c8620)
* Closing connection 0
curl: (6) Could not resolve host: whoami.whoami.maesh
command terminated with exit code 6

I'm not sure if that is an issue with Maesh or K3S, but I was expecting that a Kubernetes cluster would survive a restart of the underlying OS.

dtomcej commented 4 years ago

I'm not sure if that is an issue with Maesh or K3S, but I was expecting that a Kubernetes cluster would survive a restart of the underlying OS.

Kubernetes is not designed to be able to handle that sort of situation, and in production it would be a complete rebuild if quorum was lost.

In this case, is the resolution broken on restart? A clean install works fine?

riker09 commented 4 years ago

In this case, is the resolution broken on restart? A clean install works fine?

Yes, clean install works fine.

Kubernetes is not designed to be able to handle that sort of situation

How are people developing on Kubernetes clusters if I cannot restart the cluster node? I was under the impression that losing a node should be handled gracefully by Kubernetes. Can I work around this by having at least two nodes where one is always reachable (e.g. cloud provider) and the other is my local workstation (as you've mentioned quorum loss)? I apologize for asking all these newbie questions, but that's what I currently am, a complete Kubernetes noob. I'm learning as I'm going and I really appreciate all the answers I'm getting here. Thank you!

dtomcej commented 4 years ago

Hello @riker09

How are people developing on Kubernetes clusters if I cannot restart the cluster node?

Most of us use development cluster tools such as k3s for just this reason. K3s allows for new clusters to be quickly scaffolded. There is another tool k3d that is used exclusively to quickly build k3s clusters, and we use it in our own integration tests!

https://github.com/containous/maesh/blob/master/Makefile#L43

I was under the impression that losing a node should be handled gracefully by Kubernetes.

Correct, kubernetes can handle (n-1)/2 failures where n is the number of nodes (data and control plane separate). If your control plane loses more than half its nodes, the cluster is unrecoverable. Data nodes don't have the same quorum, but if you lose more data nodes than your workload, pods will be unscheduleable.

Can I work around this by having at least two nodes where one is always reachable (e.g. cloud provider) and the other is my local workstation (as you've mentioned quorum loss)?

Its not really that feasible for development. I would look at using another scaffolding tool like k3d to quickly rebuild your cluster if you break it. I personally use the k8s installation in docker-for-mac for development, but that too has to be rebuilt when I break stuff.

riker09 commented 4 years ago

Okay, so the solution is recreating everything from scratch each time. I noticed that Persistent Volumes (and Claims) survive during a restart (guess that's the persistence everybody is talking about :slightly_smiling_face: ) and I can store data that I want to keep during reboots in a PV.

I have already looked at another scaffolding tool skaffold. I will spend some time and investigate k3d and see how they both compare. In the meantime I guess I will use a combination of helm and kubectl commands and do everything by hand until I'm more familiar with the whole Kubernetes universe.

Thanks again for your kind answers and taking your time for explaining everything.

SantoDE commented 4 years ago

No Problem. Therefore, I close this :)