Pod cannot call it's own service

runesl commented 6 years ago

I installed flink 1.5 on kubernetes on DC/OS, but it seems the jobmanager pod cannot call itself through the jobmanager service, which means that flink cannot deploy jars and the cluster is useless.

I wonder if there is a bug in the DC/OS overlay network integration with kubernetes? I'm pretty sure this should work on a normally working kubernetes cluster.

kubectl get pods NAME READY STATUS RESTARTS AGE flink-jobmanager-75bbb96f4d-p5fh8 1/1 Running 0 1h flink-taskmanager-7679c9d55d-dzcrk 1/1 Running 0 1h flink-taskmanager-7679c9d55d-z9lsp 1/1 Running 0 1h

kubectl get svc flink-jobmanager NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE flink-jobmanager ClusterIP 10.100.242.186 6123/TCP,6124/TCP,6125/TCP,8081/TCP 1h

calling jobmanager from a taskmanager pod works fine:

kubectl exec -it flink-taskmanager-7679c9d55d-z9lsp /bin/bash root@flink-taskmanager-7679c9d55d-z9lsp:/opt/flink# curl flink-jobmanager:8081 -v

Rebuilt URL to: flink-jobmanager:8081/

Trying 10.100.242.186...

TCP_NODELAY set

Connected to flink-jobmanager (10.100.242.186) port 8081 (#0) GET / HTTP/1.1 Host: flink-jobmanager:8081 User-Agent: curl/7.52.1 Accept: /

< HTTP/1.1 200 OK

Calling jobmanager service url from jobmanager pod fails with timeout. Notice the same service IP is used, so dns seems to be working fine:

kubectl exec -it flink-jobmanager-75bbb96f4d-p5fh8 /bin/bash root@flink-jobmanager-75bbb96f4d-p5fh8:/opt/flink# curl flink-jobmanager:8081 -v

Rebuilt URL to: flink-jobmanager:8081/

Trying 10.100.242.186...

TCP_NODELAY set

Calling jobmanager on localhost from jobmanager pod works:

root@flink-jobmanager-75bbb96f4d-p5fh8:/opt/flink# curl localhost:8081 -v

Rebuilt URL to: localhost:8081/

Trying 127.0.0.1...

TCP_NODELAY set

Connected to localhost (127.0.0.1) port 8081 (#0) GET / HTTP/1.1 Host: localhost:8081 User-Agent: curl/7.52.1 Accept: /

< HTTP/1.1 200 OK

pires commented 6 years ago

Thanks for reporting. We will be looking into it soon.

sreis commented 6 years ago

Good news is that I was able to replicate it. Bad news is that a few minutes later as soon as I started doing a network capture the issue vanished and I was able to call the service from the pod it's pointing to.

So... it seems to be an issue with the overlay. I'll follow up with the networking team and provide more details as soon as possible.

sreis commented 6 years ago

So it's not an overlay issue and it's related to how we are not setting the hairpin mode correctly.

This ticket in kubernetes provided the clue https://github.com/kubernetes/kubernetes/issues/13375

Has a workaround until we release a proper fix I suggest setting it manually. For example like this:

$ dcos task exec -ti kube-node-N-kubelet bash
...
$ for intf in /sys/devices/virtual/net/m-dcos/brif/*; do echo 1 > ${intf}/hairpin_mode; done

runesl commented 6 years ago

Thank you. The workaround works, and with it I'm able to deploy and succesfully run Flink 1.5 on kubernetes on DCOS.

sreis commented 6 years ago

@rsltrifork We enabled hairpin mode by default in the new release. Can you test it and then close close the ticket?

runesl commented 6 years ago

fixed

mesosphere / dcos-kubernetes-quickstart

Pod cannot call it's own service #95