radanalyticsio / openshift-spark

72 stars 83 forks source link

Allow spark to be installed as an s2i build #66

Closed tmckayus closed 6 years ago

tmckayus commented 6 years ago

This is an initial change to allow spark to be installed during an s2i build if it is left out of the creation of the openshift-spark image.

This is a step toward a mechanism which will allow users to build openshift-spark images with custom spark installs in OpenShift Origin rather than modifying github repos and running local builds.

tmckayus commented 6 years ago

@crobby @elmiko ptal

elmiko commented 6 years ago

i am trying to test out the functionality of this pr, so far this what i've gotten

  1. i ran make in the root of the project and all images built
  2. i did a docker run --rm -it --entrypoint=/bin/sh openshift-spark:latest and then checked to see what was installed, i see the latest spark is installed. looks like this is a standard community image at this point.
  3. i pushed the image from step 1 to my project related registry 172.30.1.1:5000/foo
  4. i took the openshift-spark image and ran the following command inspired by the build_env_var test from install_spark.sh: oc new-build --name=spark --docker-image=172.30.1.1:5000/foo/openshift-spark:latest -e SPARK_URL=https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz -e SPARK_MD5_URL=https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-without-hadoop.tgz.md5 --binary
  5. i tried to run a build as per the install_spark.sh test, but got an error:
    [mike@shift openshift-spark]$ oc start-build spark
    error: Build configuration foo/spark has no valid source inputs, if this is a binary build you must specify one of '--from-dir', '--from-repo', or '--from-file'

not sure what to do here, this appears to be exactly what the test is doing after the new-build command, but for some reason i am getting this error.

is there a better way to test this?

also, i think we need some instructions on how to do this, i am having a really difficult time figuring out what i'm supposed to be doing.

elmiko commented 6 years ago

just for completeness, here is my entire output

[mike@shift openshift-spark]$ oc new-build --name=spark --docker-image=172.30.1.1:5000/foo/openshift-spark:latest -e SPARK_URL=https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz -e SPARK_MD5_URL=https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-without-hadoop.tgz.md5 --binary
W0914 15:06:22.175344   17306 dockerimagelookup.go:233] Docker registry lookup failed: Get https://172.30.1.1:5000/v2/: http: server gave HTTP response to HTTPS client
W0914 15:06:22.204925   17306 newapp.go:464] Could not find an image stream match for "172.30.1.1:5000/foo/openshift-spark:latest". Make sure that a Docker image with that tag is available on the node for the build to succeed.
--> Found Docker image 2da8c62 (40 minutes old) from 172.30.1.1:5000 for "172.30.1.1:5000/foo/openshift-spark:latest"

    * A Docker build using binary input will be created
      * The resulting image will be pushed to image stream "spark:latest"
      * A binary build was created, use 'start-build --from-dir' to trigger a new build

--> Creating resources with label build=spark ...
    imagestream "spark" created
    buildconfig "spark" created
--> Success
[mike@shift openshift-spark]$ oc start-build spark
error: Build configuration foo/spark has no valid source inputs, if this is a binary build you must specify one of '--from-dir', '--from-repo', or '--from-file'
tmckayus commented 6 years ago

Ah, there is a second make file, and a second set of images.

"make -f Makefile.inc" will build openshift-spark-inc and openshift-spark-inc-py36

These are the images that can be completed.

Ultimately we'll have a script for this, not there yet

tmckayus commented 6 years ago

Here is a test that can be done to show that an incomplete image used to deploy a cluster fails with a usage script (incomplete image has to be tagged for oshinko)

$ make -f Makefile.inc build
$ docker tag openshift-spark-inc:latest 172.30.1.1:5000/myproject/openshift-spark-inc:latest
$ docker login -u developer -p $(oc whoami -t) 172.30.1.1:5000
$ docker push 172.30.1.1:5000/myproject/openshift-spark-inc:latest
$ oshinko create mary --image=172.30.1.1:5000/myproject/openshift-spark-inc:latest

Check the logs for master and worker (not sure how to force openshift to preserve blank lines in output)

Here's how to complete an image using env vars (with oc cluster up, the image can be used in this case straight from the local docker daemon, no need to tag)

$ oc new-build --name=openshift-spark --binary --docker-image=openshift-spark-inc:latest -e SPARK_URL=https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz -e SPARK_MD5_URL=https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-without-hadoop.tgz.md5 --binary
$ oc start-build openshift-spark
$ oc log -f buildconfig/openshift-spark

To complete an image using files from a local directory

$ mkdir buildfiles
$ wget https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz -O buildfiles/spark-2.2.1-bin-hadoop2.7.tgz
$ wget https://archive.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz.md5 -O buildfiles/spark-2.2.1-bin-hadoop2.7.tgz.md5
$ oc new-build --name=openshift-spark --binary --docker-image=openshift-spark-inc:latest
$ oc start-build openshift-spark --from-file=buildfiles
$ oc log -f buildconfig/openshift-spark

A succesful build will push a completed image to your project. To run the completed image with oshinko:

oshinko_linux_amd64/oshinko create molly --image=172.30.1.1:5000/myproject/openshift-spark:latest

From here, using buildfiles, you can do things like change the md5 file to make it fail on a build, leave out the spark tgz altogether, replace the spark tgz with a file that's not actually a tar, etc, and check that the build fails.