Open jkremser opened 6 years ago
i'm wondering if we should bring this issue through to the radanalytics.io repo so that we can change the upstream behavior?
i guess we could also ensure that DRIVER_HOST
is set for any cluster where we know it (a SparkApplication for example).
Is there a solution to this? I don't quite understand the DRIVER_HOST setting you mentioned @jkremser because each driver will have a different, unique fqdn.
@jeynesrya DRIVER_HOST
is an environment variable that we can set in the spark cluster nodes to inform them of the fully qualified domain name(fqdn) for the driver.
in spark 2.3 there was a change that required the workers in a cluster to know the fqdn for the driver, as this is the new style for network addressing(as opposed to the older ip address method). in some of the radanalytics tooling, there is a check to determine the kubernetes service name for the driver and this then gets used in the spark cluster.
you could manually set the DRIVER_HOST
value when spawning a cluster if you are having issues, a better solution though would be to set spark.driver.host
in your configuration for the driver application when connecting to the spark cluster. you can read a little more about it here in the spark docs
@elmiko thanks for the info. I am running the spark operator in Openshift Origin (OKD) and seem to be hitting an issue whereby I submit a job from within the master thus the driver is created and then the executors are created but when the executor try to communicate, I get a UnknownHostException exception on cluster-w-h6dfb (as an example). I have added both the DRIVER_HOST and spark.driver.host and neither seem to help. It's almost as though the workers can't resolve other workers' dns names. I have exec'd into the containers and they're able to ping each others ip addresses but not dns names. If this issue is out of scope of this issue, I'll raise another one with more details but is this an issue that can be resolved using the above steps? If so, how? Are there any example?
@jeynesrya i'm not sure if it's the same issue as this. i would need to know a little more, it would probably be best to open a new issue and describe how you encountered the problem. for example, step 1. i installed spark-operator, step 2. i launched a cluster, step 3... you get the idea.
to me, it sounds like you are just having some issues getting the workflow correct with the spark cluster and your driver application. i don't think it's an issue with the spark-operator, but i'm happy to help you find the issue =)
@elmiko I've opened an issue: #252
if the master and workers have the DRIVER_HOST set to the service that points to the driver, it should fix the issue...
resources.yaml
should provide some tipsworkaround: in a notebook you can reconfigure the spark conf to see the driver