radanalyticsio / openshift-spark

72 stars 83 forks source link

Using 7077 port to check if master is listening #78

Closed jkremser closed 4 years ago

jkremser commented 5 years ago

Checking if Spark master is up and running by trying port 7077 instead of depending on master's web ui service

elmiko commented 5 years ago

thanks Jirka!

this looks pretty straightforward, i will give it a test locally.

elmiko commented 5 years ago

i tested this with the following

$ oc version
oc v3.11.0+bcca01e-65
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://shift.opb.studios:8443
openshift v3.11.0+bcca01e-65
kubernetes v1.11.0+d4cacc0

and used the oshinko-cli tool to deploy

$ oshinko version
oshinko 0.5.4-7acbd8382
Default spark image: radanalyticsio/openshift-spark:2.3-latest
$ oshinko create --image=quay.io/elmiko/openshift-spark:check_master foo                                                            
shared cluster "foo" created
$ oshinko create --image=quay.io/elmiko/openshift-spark-py36:check_master bar
shared cluster "bar" created 

both clusters deployed and ran as expected.

as an experiment, i tried to deploy these images with the spark-operator but they failed. i'm not sure why though, any thoughts?

jkremser commented 5 years ago

..any thoughts?

I've just tried them with the operator and it worked. Are you deploying the correct images? I ran make build on this project and it created

openshift-spark-py36                         latest              715659442a4d        8 minutes ago       1.14 GB
openshift-spark                              latest              18665b389f99        9 minutes ago       906 MB

It looks like you are deploying quay.io/elmiko/openshift-spark:check_master and quay.io/elmiko/openshift-spark-py36:check_master. Can you try docker images (or podman equivalent) and check their creation time if it corresponds w/ the build time?

I tried it with (prometheus) metrics enabled and disabled, and with ui enabled and disabled (w/ my branch from PR).

apiVersion: v1
kind: ConfigMap
metadata:
  name: my-spark-cluster
  labels:
    radanalytics.io/kind: SparkCluster
data:
  config: |-
    metrics: "false|true"
    sparkWebUI: "false|true"
    customImage: openshift-spark:latest
    worker:
      instances: "2"
    master:
      instances: "1"
oc get pods                                                                          localhost.localdomain: Tue Feb 26 13:08:55 2019

NAME                             READY     STATUS    RESTARTS   AGE
my-spark-cluster-m-2wcgv         1/1       Running   0          5m
my-spark-cluster-w-6cpfk         1/1       Running   0          5m
my-spark-cluster-w-bdn9r         1/1       Running   0          5m
spark-operator-385794169-rmxlc   1/1       Running   0          11m
λ oc describe po my-spark-cluster-w-6cpfk | grep Image
    Image:      openshift-spark:latest
    Image ID:       docker://sha256:18665b389f99cc0b2674e13e536fe0b83962a7cc8e0fb3ec0eb26f5b494e5d9f

it's 100% the new image I am running:

λ docker exec 08e406a6a920 cat /launch.sh | grep MASTER_HOST_AND_PORT
    _MASTER_HOST_AND_PORT=$(echo $SPARK_MASTER_ADDRESS | sed -r 's;.*//(.*):(.*);\1/\2;g')
        timeout 1 sh -c "(</dev/tcp/$_MASTER_HOST_AND_PORT) &>/dev/null"

It's weird, can you please post also the logs from the workers?

elmiko commented 5 years ago

I've just tried them with the operator and it worked. Are you deploying the correct images?

i will set it up and test again today, hopefully i just made a simple mistake.

elmiko commented 5 years ago

i tried this again with both CRD and ConfigMap deployments from spark-operator. i am also using the master spark-operator that i build locally.

i inspected the images and confirmed that the expected changes were in launch.sh

logs from the worker

Starting worker, will connect to: spark://spark-fbd1:7077
Waiting for spark master to be available ...
19/02/26 14:47:27 INFO Worker: Started daemon with process name: 11@spark-fbd1-w-4m2xl
19/02/26 14:47:27 INFO SignalUtils: Registered signal handler for TERM
19/02/26 14:47:27 INFO SignalUtils: Registered signal handler for HUP
19/02/26 14:47:27 INFO SignalUtils: Registered signal handler for INT
19/02/26 14:47:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/26 14:47:28 INFO SecurityManager: Changing view acls to: 1000540000
19/02/26 14:47:28 INFO SecurityManager: Changing modify acls to: 1000540000
19/02/26 14:47:28 INFO SecurityManager: Changing view acls groups to: 
19/02/26 14:47:28 INFO SecurityManager: Changing modify acls groups to: 
19/02/26 14:47:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(1000540000); groups with view permissions: Set(); users  with modify permissions: Set(1000540000); groups with modify permissions: Set()
19/02/26 14:47:28 INFO Utils: Successfully started service 'sparkWorker' on port 36923.
19/02/26 14:47:28 INFO Worker: Starting Spark worker 10.128.1.237:36923 with 1 cores, 14.1 GB RAM
19/02/26 14:47:28 INFO Worker: Running Spark version 2.3.0
19/02/26 14:47:28 INFO Worker: Spark home: /opt/spark
19/02/26 14:47:28 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
19/02/26 14:47:28 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://spark-fbd1-w-4m2xl:8081
19/02/26 14:47:28 INFO Worker: Connecting to master spark-fbd1:7077...
19/02/26 14:47:28 INFO TransportClientFactory: Successfully created connection to spark-fbd1/172.30.217.118:7077 after 43 ms (0 ms spent in bootstraps)
19/02/26 14:47:28 INFO Worker: Successfully registered with master spark://10.128.1.236:7077
19/02/26 14:47:28 INFO Worker: WorkerWebUI is available at //proxy/worker-20190226144728-10.128.1.237-36923
Received a termination signal
Stopping subprocess 11
19/02/26 14:47:55 ERROR Worker: RECEIVED SIGNAL TERM
Subprocess stopped
elmiko commented 5 years ago

@tmckayus @crobby ptal

jkremser commented 5 years ago

Jenkins is failing on:

oc login https://et35.et.eng.bos.redhat.com:8443 -u *** -p *** --insecure-skip-tls-verify=true
error: dial tcp 10.19.47.76:8443: getsockopt: no route to host - verify you have provided the correct host and port and that the server is currently running.

I have no idea how to make it pass, the et35.et.eng.bos.redhat.com host is just not there.

elmiko commented 5 years ago

I have no idea how to make it pass, the et35.et.eng.bos.redhat.com host is just not there.

i'm not sure it's possible to make that test pass. iirc, the jenkins was meant for use on a different cluster, but i think @tmckayus could provide more details.

tmckayus commented 4 years ago

Resubmitted this as https://github.com/radanalyticsio/openshift-spark/pull/115 to remove conflicts