Closed runesl closed 6 years ago
Thanks for reporting. We will be looking into it soon.
Good news is that I was able to replicate it. Bad news is that a few minutes later as soon as I started doing a network capture the issue vanished and I was able to call the service from the pod it's pointing to.
So... it seems to be an issue with the overlay. I'll follow up with the networking team and provide more details as soon as possible.
So it's not an overlay issue and it's related to how we are not setting the hairpin mode correctly.
This ticket in kubernetes provided the clue https://github.com/kubernetes/kubernetes/issues/13375
Has a workaround until we release a proper fix I suggest setting it manually. For example like this:
$ dcos task exec -ti kube-node-N-kubelet bash
...
$ for intf in /sys/devices/virtual/net/m-dcos/brif/*; do echo 1 > ${intf}/hairpin_mode; done
Thank you. The workaround works, and with it I'm able to deploy and succesfully run Flink 1.5 on kubernetes on DCOS.
@rsltrifork We enabled hairpin mode by default in the new release. Can you test it and then close close the ticket?
fixed
I installed flink 1.5 on kubernetes on DC/OS, but it seems the jobmanager pod cannot call itself through the jobmanager service, which means that flink cannot deploy jars and the cluster is useless.
I wonder if there is a bug in the DC/OS overlay network integration with kubernetes? I'm pretty sure this should work on a normally working kubernetes cluster.
calling jobmanager from a taskmanager pod works fine:
Calling jobmanager service url from jobmanager pod fails with timeout. Notice the same service IP is used, so dns seems to be working fine:
Calling jobmanager on localhost from jobmanager pod works: