kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.75k stars 1.36k forks source link

Spark executors error out because driver pod is not ready #1200

Open shamukh82 opened 3 years ago

shamukh82 commented 3 years ago

We are facing an issue where the executor pods go into error state right after coming up. This is the error we see in the pods: java.net.UnknownHostException: XXXX-XXXXX-driver-svc.help-XXXX-XXXX-usw2-e2e.svc. Only a couple of executors at the very end are able to talk to the driver and end up in running state.

We checked the timeline of the various stages the driver and executor went through:

Pod creation timeline for testapp (executors and driver):

testapp-6631c8781e4df39c-exec-1 2021-03-10T22:42:09Z testapp-6631c8781e4df39c-exec-2 2021-03-10T22:42:09Z testapp-6631c8781e4df39c-exec-3 2021-03-10T22:42:09Z testapp-6631c8781e4df39c-exec-4 2021-03-10T22:42:09Z testapp-6631c8781e4df39c-exec-5 2021-03-10T22:42:09Z testapp-6631c8781e4df39c-exec-6 2021-03-10T22:42:18Z testapp-6631c8781e4df39c-exec-7 2021-03-10T22:42:18Z testapp-6631c8781e4df39c-exec-8 2021-03-10T22:42:18Z testapp-6631c8781e4df39c-exec-9 2021-03-10T22:42:18Z testapp-6631c8781e4df39c-exec-10 2021-03-10T22:42:18Z testapp-6631c8781e4df39c-exec-11 2021-03-10T22:46:42Z testapp-6631c8781e4df39c-exec-12 2021-03-10T22:46:42Z testapp-6631c8781e4df39c-exec-13 2021-03-10T22:46:42Z testapp-6631c8781e4df39c-exec-14 2021-03-10T22:46:42Z testapp-6631c8781e4df39c-exec-15 2021-03-10T22:46:42Z based on pod status on XXXX-pipeline-driver, Pod was only ready at (22:43:10)

status: conditions:

For Headless service pods, pod IP is in dns response, DNS is only updated when the endpoint object is updated. The endpoint is marked ready only when the pod is ready

I believe we would need to wait for XXXX-pipeline-driver to be ready before you deploy testapp pods. Is there a way that we can handle this?

shamukh82 commented 3 years ago

Our workaround now is to have an init container to introduce delay. But would like to know if theres a better way (spark configs etc) to handle this

jdonnelly-apixio commented 3 years ago

I'm not sure I hit this same issue, but if you see dns unknown host due to a dns request timeout in spark exec logs, I think it would be the same issue I saw.

I forget the exact cause, but I believe there is a race condition where multiple dns reqs are sent from a new executor and that causes only one of them to be returned and the executor dns request for the driver pod will timeout. A work around is to add this to executors:

    dnsConfig:
      options:
      - name: ndots
        value: "2"
      - name: edns0
      - name: single-request-reopen
tjulinfan commented 2 years ago

Hi there, we encountered the same issue. Anyone knows the root cause?

yangl900 commented 1 year ago

Hi there, we encountered the same issue. Anyone knows the root cause?

If @jdonnelly-apixio 's workaround works for you, this is the root cause: https://www.weave.works/blog/racy-conntrack-and-dns-lookup-timeouts