Closed jkremser closed 4 years ago
thanks Jirka!
this looks pretty straightforward, i will give it a test locally.
i tested this with the following
$ oc version
oc v3.11.0+bcca01e-65
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://shift.opb.studios:8443
openshift v3.11.0+bcca01e-65
kubernetes v1.11.0+d4cacc0
and used the oshinko-cli
tool to deploy
$ oshinko version
oshinko 0.5.4-7acbd8382
Default spark image: radanalyticsio/openshift-spark:2.3-latest
$ oshinko create --image=quay.io/elmiko/openshift-spark:check_master foo
shared cluster "foo" created
$ oshinko create --image=quay.io/elmiko/openshift-spark-py36:check_master bar
shared cluster "bar" created
both clusters deployed and ran as expected.
as an experiment, i tried to deploy these images with the spark-operator but they failed. i'm not sure why though, any thoughts?
..any thoughts?
I've just tried them with the operator and it worked. Are you deploying the correct images? I ran make build
on this project and it created
openshift-spark-py36 latest 715659442a4d 8 minutes ago 1.14 GB
openshift-spark latest 18665b389f99 9 minutes ago 906 MB
It looks like you are deploying quay.io/elmiko/openshift-spark:check_master
and quay.io/elmiko/openshift-spark-py36:check_master
. Can you try docker images
(or podman equivalent) and check their creation time if it corresponds w/ the build time?
I tried it with (prometheus) metrics enabled and disabled, and with ui enabled and disabled (w/ my branch from PR).
apiVersion: v1
kind: ConfigMap
metadata:
name: my-spark-cluster
labels:
radanalytics.io/kind: SparkCluster
data:
config: |-
metrics: "false|true"
sparkWebUI: "false|true"
customImage: openshift-spark:latest
worker:
instances: "2"
master:
instances: "1"
oc get pods localhost.localdomain: Tue Feb 26 13:08:55 2019
NAME READY STATUS RESTARTS AGE
my-spark-cluster-m-2wcgv 1/1 Running 0 5m
my-spark-cluster-w-6cpfk 1/1 Running 0 5m
my-spark-cluster-w-bdn9r 1/1 Running 0 5m
spark-operator-385794169-rmxlc 1/1 Running 0 11m
λ oc describe po my-spark-cluster-w-6cpfk | grep Image
Image: openshift-spark:latest
Image ID: docker://sha256:18665b389f99cc0b2674e13e536fe0b83962a7cc8e0fb3ec0eb26f5b494e5d9f
it's 100% the new image I am running:
λ docker exec 08e406a6a920 cat /launch.sh | grep MASTER_HOST_AND_PORT
_MASTER_HOST_AND_PORT=$(echo $SPARK_MASTER_ADDRESS | sed -r 's;.*//(.*):(.*);\1/\2;g')
timeout 1 sh -c "(</dev/tcp/$_MASTER_HOST_AND_PORT) &>/dev/null"
It's weird, can you please post also the logs from the workers?
I've just tried them with the operator and it worked. Are you deploying the correct images?
i will set it up and test again today, hopefully i just made a simple mistake.
i tried this again with both CRD and ConfigMap deployments from spark-operator. i am also using the master spark-operator that i build locally.
i inspected the images and confirmed that the expected changes were in launch.sh
logs from the worker
Starting worker, will connect to: spark://spark-fbd1:7077
Waiting for spark master to be available ...
19/02/26 14:47:27 INFO Worker: Started daemon with process name: 11@spark-fbd1-w-4m2xl
19/02/26 14:47:27 INFO SignalUtils: Registered signal handler for TERM
19/02/26 14:47:27 INFO SignalUtils: Registered signal handler for HUP
19/02/26 14:47:27 INFO SignalUtils: Registered signal handler for INT
19/02/26 14:47:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/26 14:47:28 INFO SecurityManager: Changing view acls to: 1000540000
19/02/26 14:47:28 INFO SecurityManager: Changing modify acls to: 1000540000
19/02/26 14:47:28 INFO SecurityManager: Changing view acls groups to:
19/02/26 14:47:28 INFO SecurityManager: Changing modify acls groups to:
19/02/26 14:47:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(1000540000); groups with view permissions: Set(); users with modify permissions: Set(1000540000); groups with modify permissions: Set()
19/02/26 14:47:28 INFO Utils: Successfully started service 'sparkWorker' on port 36923.
19/02/26 14:47:28 INFO Worker: Starting Spark worker 10.128.1.237:36923 with 1 cores, 14.1 GB RAM
19/02/26 14:47:28 INFO Worker: Running Spark version 2.3.0
19/02/26 14:47:28 INFO Worker: Spark home: /opt/spark
19/02/26 14:47:28 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
19/02/26 14:47:28 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://spark-fbd1-w-4m2xl:8081
19/02/26 14:47:28 INFO Worker: Connecting to master spark-fbd1:7077...
19/02/26 14:47:28 INFO TransportClientFactory: Successfully created connection to spark-fbd1/172.30.217.118:7077 after 43 ms (0 ms spent in bootstraps)
19/02/26 14:47:28 INFO Worker: Successfully registered with master spark://10.128.1.236:7077
19/02/26 14:47:28 INFO Worker: WorkerWebUI is available at //proxy/worker-20190226144728-10.128.1.237-36923
Received a termination signal
Stopping subprocess 11
19/02/26 14:47:55 ERROR Worker: RECEIVED SIGNAL TERM
Subprocess stopped
@tmckayus @crobby ptal
Jenkins is failing on:
oc login https://et35.et.eng.bos.redhat.com:8443 -u *** -p *** --insecure-skip-tls-verify=true
error: dial tcp 10.19.47.76:8443: getsockopt: no route to host - verify you have provided the correct host and port and that the server is currently running.
I have no idea how to make it pass, the et35.et.eng.bos.redhat.com
host is just not there.
I have no idea how to make it pass, the
et35.et.eng.bos.redhat.com
host is just not there.
i'm not sure it's possible to make that test pass. iirc, the jenkins was meant for use on a different cluster, but i think @tmckayus could provide more details.
Resubmitted this as https://github.com/radanalyticsio/openshift-spark/pull/115 to remove conflicts
Checking if Spark master is up and running by trying port 7077 instead of depending on master's web ui service