issue with Spark 2.3 when master can't contact the driver, because of wrong/missing fqdn

radanalyticsio / spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.

Apache License 2.0

156 stars 61 forks source link

issue with Spark 2.3 when master can't contact the driver, because of wrong/missing fqdn #96

Open jkremser opened 6 years ago

jkremser commented 6 years ago

if the master and workers have the DRIVER_HOST set to the service that points to the driver, it should fix the issue...

resources.yaml should provide some tips

workaround: in a notebook you can reconfigure the spark conf to see the driver

elmiko commented 5 years ago

i'm wondering if we should bring this issue through to the radanalytics.io repo so that we can change the upstream behavior?

i guess we could also ensure that DRIVER_HOST is set for any cluster where we know it (a SparkApplication for example).

jeynesrya commented 5 years ago

Is there a solution to this? I don't quite understand the DRIVER_HOST setting you mentioned @jkremser because each driver will have a different, unique fqdn.

elmiko commented 5 years ago

@jeynesrya DRIVER_HOST is an environment variable that we can set in the spark cluster nodes to inform them of the fully qualified domain name(fqdn) for the driver.

in spark 2.3 there was a change that required the workers in a cluster to know the fqdn for the driver, as this is the new style for network addressing(as opposed to the older ip address method). in some of the radanalytics tooling, there is a check to determine the kubernetes service name for the driver and this then gets used in the spark cluster.

you could manually set the DRIVER_HOST value when spawning a cluster if you are having issues, a better solution though would be to set spark.driver.host in your configuration for the driver application when connecting to the spark cluster. you can read a little more about it here in the spark docs

jeynesrya commented 4 years ago

@elmiko thanks for the info. I am running the spark operator in Openshift Origin (OKD) and seem to be hitting an issue whereby I submit a job from within the master thus the driver is created and then the executors are created but when the executor try to communicate, I get a UnknownHostException exception on cluster-w-h6dfb (as an example). I have added both the DRIVER_HOST and spark.driver.host and neither seem to help. It's almost as though the workers can't resolve other workers' dns names. I have exec'd into the containers and they're able to ping each others ip addresses but not dns names. If this issue is out of scope of this issue, I'll raise another one with more details but is this an issue that can be resolved using the above steps? If so, how? Are there any example?

elmiko commented 4 years ago

@jeynesrya i'm not sure if it's the same issue as this. i would need to know a little more, it would probably be best to open a new issue and describe how you encountered the problem. for example, step 1. i installed spark-operator, step 2. i launched a cluster, step 3... you get the idea.

to me, it sounds like you are just having some issues getting the workflow correct with the spark cluster and your driver application. i don't think it's an issue with the spark-operator, but i'm happy to help you find the issue =)

jeynesrya commented 4 years ago

@elmiko I've opened an issue: #252