gluster / gluster-kubernetes

GlusterFS Native Storage Service for Kubernetes
Apache License 2.0
875 stars 389 forks source link

gk-deploy script generates issues at network layer whilst running K8's and Tungsten Fabric #552

Closed haji-haji-haji closed 5 years ago

haji-haji-haji commented 5 years ago

Hey guys,

Whilst deploying gk-deploy we notice that it causes some weirdness that only an --abort and reboot of whole K8 cluster can resolve.

Few other things to point out:

1) we have defined 3 servers in the template but only 2 are used whilst building glusterfs in gk-deploy (is this by design or another bug). 2) the gk-deploy requires the template in the same directory as itself versus the documented child path 3) the cubes do deploy the services but are inaccessible via the overlay and only connecting via bash and curl self on 8080/hello we can see its actually started.

Output from K8's:

Output from K8's.txt

Hello from Heketi inside cube:

[root@lab-sdn-tungsten01 gkdeploy]# kubectl exec -n glusterfs -it deploy-heketi-7676968dcb-mkts8 -- /bin/bash
[root@deploy-heketi-7676968dcb-mkts8 /]# curl localhost:8080/hello

No response out of cube:

[root@lab-sdn-tungsten01 ~]# curl 10.47.255.252:8080/hello
curl: (7) Failed connect to 10.47.255.252:8080; Connection refused
[root@lab-sdn-tungsten01 ~]# curl 10.47.255.252:8080/hello
curl: (7) Failed connect to 10.47.255.252:8080; Connection refused

Output from gk-deploy:

Output from gk-deploy.txt

All help is appreciated!

~Haji

phlogistonjohn commented 5 years ago

Hey guys,

Whilst deploying gk-deploy we notice that it causes some weirdness that only an --abort and reboot of whole K8 cluster can resolve.

Few other things to point out:

1. we have defined 3 servers in the template but only 2 are used whilst building glusterfs in gk-deploy (is this by design or another bug).

You may want to elaborate more on what how you determined that only 2 are used, but if so, that does not sound like expected behavior. How many unique nodes do you have? I assume by template you mean the topology file?

2. the gk-deploy requires the template in the same directory as itself versus the documented child path

If this is related to the other issue you recently filed and you are referring to the topology file, that is the normal behavior IMO. The topology file is distinct from the k8s templates and exists to drive functionality in the heketi compnent directly. k8s is not aware of this file and does not need to be.

3. the cubes do deploy the services but are inaccessible via the overlay and only connecting via bash and curl self on 8080/hello we can see its actually started.

I am not entirely sure what you mean by this. Based on my experience I would think you would get 4 pods, 1 heketi pod and 3 gluster pods. Only the heketi pod provides the /hello URL so I assume this point is heketi specific. IIRC the getting started guides either have you using an IP or a kubectl proxy command. I don't think the system sets up anything additional on the networking for heketi but I am not 100% certain of this.

Output from K8's:

Output from K8's.txt

Hello from Heketi inside cube:

[root@lab-sdn-tungsten01 gkdeploy]# kubectl exec -n glusterfs -it deploy-heketi-7676968dcb-mkts8 -- /bin/bash
[root@deploy-heketi-7676968dcb-mkts8 /]# curl localhost:8080/hello

No response out of cube:

[root@lab-sdn-tungsten01 ~]# curl 10.47.255.252:8080/hello
curl: (7) Failed connect to 10.47.255.252:8080; Connection refused
[root@lab-sdn-tungsten01 ~]# curl 10.47.255.252:8080/hello
curl: (7) Failed connect to 10.47.255.252:8080; Connection refused

Output from gk-deploy:

Output from gk-deploy.txt

All help is appreciated!

~Haji

From the outputs provided it certainly looks like there's an issue with bringing up the pod as the liveness and readiness probes fail.

I'm not very knowledgeable about the k8s networking but I would start by looking through the logs on the node where the pod was started and seeing if you can see anything related to failures to set up proper networking for the pod or other cni related items that might look like a problem.

haji-haji-haji commented 5 years ago

Issue was another unrelated networking issue that I have resolved and the script operates as it should.