Cluster can sometimes get into a weird state where pods are no longer discoverable via the k8s service hostnames...
My guess is there is some DNS cache that is not being expired when pods get scheduled on new hosts. I guess existing somewhere on k8s side, since when we restart pods the problem persists (sometimes the pods can't even talk to each other!) This affects ZK too, unfortunately...
Some ideas for how to fix this are
Create headless services for each pod since we don't have many pods anyway, and advertise that instead
Something about nodeports, this can also make it accessible from outside k8s (!) but I read something about it making all traffic go through master which I'm not sure if it is true. If it is we're probably OK until someone starts forwarding syslog to kafka and then 30 days later some machine horribly breaks and spits out like 10 MB/s to syslog
Cluster can sometimes get into a weird state where pods are no longer discoverable via the k8s service hostnames...
My guess is there is some DNS cache that is not being expired when pods get scheduled on new hosts. I guess existing somewhere on k8s side, since when we restart pods the problem persists (sometimes the pods can't even talk to each other!) This affects ZK too, unfortunately...
Some ideas for how to fix this are