radanalyticsio / openshift-spark

72 stars 83 forks source link

different versions of Openshift + Jupyter #70

Closed ljakob closed 6 years ago

ljakob commented 6 years ago

Hi, I'm quite new in the Spark world, but I bounced into following problem:

I tried to patch https://radanalytics.io/resources.yaml to fall back to Spark 2.2 but that didn't work.

Can you help me or update the jupyter image to spark 2.3?

Thanks

Leif

elmiko commented 6 years ago

hi there,

this is highlighting a gap in our tooling currently, i imagine that you are trying to connect a jupyter notebook to a spark cluster?

assuming this is the case, there are difficulties in getting spark 2.3 connected between the notebook and a spark cluster. it's probably easier to use our spark-2.2 tooling. to do this you will need to use an archived copy of the resources.yaml file. this template https://github.com/radanalyticsio/radanalyticsio.github.io/blob/master/openshift/resources-v0.4.0.yaml is a previous release that still uses spark-2.2 for the oshinko tooling. the only downside here is that the manifest doesn't contain the jupyter template. so you will need to install that manually.

under the assumption that you are looking to connect the notebook to spark, you might also want to checkout this blog post that i wrote: https://elmiko.github.io/2018/08/05/attaching-notebooks-with-radanalytics.html it will walk you through the process of connecting these bits using the current release stuff.

we haven't updated the jupyter image to spark 2.3 yet, but if you wanted to experiment with changing the spark version i would recommend looking at this repo https://github.com/radanalyticsio/base-notebook . you will need to change the Dockerfile FROM line to use radanalyticsio/openshift-spark:2.3-latest as the image.

hope this helps, good luck!

ljakob commented 6 years ago

Hi,

thanks for the quick response. I've already patched a local version of https://github.com/radanalyticsio/base-notebook and it works (just change base-image to spark 2.3 as you suggested). Now I bounced into an other problem with the spark cluster, but that needs further investigation.

Thanks

Leif

ljakob commented 6 years ago

Hi,

it's working. The remaining problem was, that the cluster worker couldn't connect back to jupyter caused by the DNS logic of openshift. Following code runs fine on the cluster:

print 'hello world 1\n'

import pyspark
import socket

conf=pyspark.SparkConf().setMaster('spark://demo123:7077')
conf = conf.set("spark.driver.host",socket.gethostbyname(socket.gethostname()))
conf = conf.setAppName('demoApp')
sc=pyspark.SparkContext(conf=conf)

import random
num_samples = 1000000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = float(4 * count) / num_samples
print(pi)

sc.stop()

print 'hello world 2\n'

The trick was to set spark.driver.host to the IP address and not to the local hostname of the POD running Jupyter. It's not elegant but it works.

Thanks for the nice Openshift templates!

Leif

elmiko commented 6 years ago

no problem, glad you figured out the host ip trick. we plan on smoothing that out but are working through a few issues. i talk a little about the notebook issue at the bottom of this blog https://elmiko.github.io/2018/05/18/python3-coming-to-radanalytics.html

elmiko commented 6 years ago

i'm going to close out this issue, thanks again =)