Closed ljakob closed 6 years ago
hi there,
this is highlighting a gap in our tooling currently, i imagine that you are trying to connect a jupyter notebook to a spark cluster?
assuming this is the case, there are difficulties in getting spark 2.3 connected between the notebook and a spark cluster. it's probably easier to use our spark-2.2 tooling. to do this you will need to use an archived copy of the resources.yaml file. this template https://github.com/radanalyticsio/radanalyticsio.github.io/blob/master/openshift/resources-v0.4.0.yaml is a previous release that still uses spark-2.2 for the oshinko tooling. the only downside here is that the manifest doesn't contain the jupyter template. so you will need to install that manually.
under the assumption that you are looking to connect the notebook to spark, you might also want to checkout this blog post that i wrote: https://elmiko.github.io/2018/08/05/attaching-notebooks-with-radanalytics.html it will walk you through the process of connecting these bits using the current release stuff.
we haven't updated the jupyter image to spark 2.3 yet, but if you wanted to experiment with changing the spark version i would recommend looking at this repo https://github.com/radanalyticsio/base-notebook . you will need to change the Dockerfile FROM
line to use radanalyticsio/openshift-spark:2.3-latest
as the image.
hope this helps, good luck!
Hi,
thanks for the quick response. I've already patched a local version of https://github.com/radanalyticsio/base-notebook and it works (just change base-image to spark 2.3 as you suggested). Now I bounced into an other problem with the spark cluster, but that needs further investigation.
Thanks
Leif
Hi,
it's working. The remaining problem was, that the cluster worker couldn't connect back to jupyter caused by the DNS logic of openshift. Following code runs fine on the cluster:
print 'hello world 1\n'
import pyspark
import socket
conf=pyspark.SparkConf().setMaster('spark://demo123:7077')
conf = conf.set("spark.driver.host",socket.gethostbyname(socket.gethostname()))
conf = conf.setAppName('demoApp')
sc=pyspark.SparkContext(conf=conf)
import random
num_samples = 1000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = float(4 * count) / num_samples
print(pi)
sc.stop()
print 'hello world 2\n'
The trick was to set spark.driver.host to the IP address and not to the local hostname of the POD running Jupyter. It's not elegant but it works.
Thanks for the nice Openshift templates!
Leif
no problem, glad you figured out the host ip trick. we plan on smoothing that out but are working through a few issues. i talk a little about the notebook issue at the bottom of this blog https://elmiko.github.io/2018/05/18/python3-coming-to-radanalytics.html
i'm going to close out this issue, thanks again =)
Hi, I'm quite new in the Spark world, but I bounced into following problem:
I tried to patch https://radanalytics.io/resources.yaml to fall back to Spark 2.2 but that didn't work.
Can you help me or update the jupyter image to spark 2.3?
Thanks
Leif