Create example with persistent volumes

jkremser commented 6 years ago

ssia

reynoldsm88 commented 5 years ago

Is this a persistent volume for the history server using the sharedVolume tag? Or a shared file volume for data between master/worker?

reynoldsm88 commented 5 years ago

@elmiko @Jiri-Kremser Am I on the right track here or wildly off target?

jkremser commented 5 years ago

@reynoldsm88 this looks good

It's fine from the Kubernetes/OpenShift perspective. It would create durable data directory under /etc/spark/data. The data in it should survive the pod's restarts. Can you please also come up with some useful example from the Spark POV, where it would make sense such persistent volume?

some ideas:

In spark you can programmatically save/load RDDs or DataFrames from parquet or other formats so one possible example could be about storing/loading intermediate results explicitly.
Another could be about event logs for spark itself, but this is captured by the HistoryServer use-case, I've done.
If one calls .(un)persist() and the right StorageLevel is set, it stores the RDDs by default under /tmp ( https://stackoverflow.com/a/30057925/1594980 ), so this could be another example (actually pretty similar to the first one).
durable log files of spark master or workers

The ideal form would be probably some markdown file describing the use case.

reynoldsm88 commented 5 years ago

@Jiri-Kremser something is wrong with that config. When I deploy that there are no errors and it looks like the PVC is created and the M/W containers come up fine. However, there is no /opt/spark/mount directory (I changed it because /etc has restricted permissions). Also, looking at the status of the mount after the deployment looks like it was never bound. Was that the write syntax for mounting that volume to the master/worker?

Michaels-MacBook-Pro-2:spark-operator michael$ oc get pvc
NAME                   STATUS    VOLUME                 CAPACITY   ACCESS MODES   STORAGECLASS   AGE
my-spark-cluster-pvc   Pending   my-spark-cluster-pvc   0                                        12m

reynoldsm88 commented 5 years ago

I'll just post my current config here so no one has to hunt around for it:

--- 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-spark-cluster-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  volumeName: my-spark-cluster-pvc

---
apiVersion: radanalytics.io/v1
kind: SparkCluster
metadata:
  name: cluster-with-shared-volume-2
spec:
  master:
    instances: "1"
    volumeMounts:
      mountPath: /opt/spark/mount
      name: spark-data
  worker:
    instances: "2"
    volumeMounts:
      mountPath: /opt/spark/mount
      name: spark-data
  volumes:
    name: spark-data
    persistentVolumeClaim:
        claimName: my-spark-cluster-pvc

jkremser commented 5 years ago

if it's in pending state, it can't find the proper PV, in oc cluster up mode, it pre-creates couple of persistent volumes, not sure if it is the case for minishift. If not, just create simple PV with hostPath

reynoldsm88 commented 5 years ago

hmm, not sure if my config is correct... i updated it to use one of the pre created PV's that minishift creates. However, when i deploy that yaml it's almost as if it doesn't process the PVC elements on the master/worker. Here is the yaml output of the Spark Master RC that's created with that config. I don't see anything about the volumes there:

apiVersion: v1
kind: ReplicationController
metadata:
  creationTimestamp: 2019-03-27T17:31:24Z
  generation: 3
  labels:
    radanalytics.io/SparkCluster: cluster-with-shared-volume
    radanalytics.io/kind: SparkCluster
    radanalytics.io/rcType: master
  name: cluster-with-shared-volume-m
  namespace: myproject
  resourceVersion: "25102"
  selfLink: /api/v1/namespaces/myproject/replicationcontrollers/cluster-with-shared-volume-m
  uid: 24fa0cbb-50b6-11e9-88fa-86e8f67b01ef
spec:
  replicas: 1
  selector:
    radanalytics.io/SparkCluster: cluster-with-shared-volume
    radanalytics.io/deployment: cluster-with-shared-volume-m
    radanalytics.io/kind: SparkCluster
  template:
    metadata:
      creationTimestamp: null
      labels:
        radanalytics.io/SparkCluster: cluster-with-shared-volume
        radanalytics.io/deployment: cluster-with-shared-volume-m
        radanalytics.io/kind: SparkCluster
        radanalytics.io/podType: master
      namespace: myproject
    spec:
      containers:
      - env:
        - name: OSHINKO_SPARK_CLUSTER
          value: cluster-with-shared-volume
        image: quay.io/jkremser/openshift-spark:2.4.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 6
          httpGet:
            path: /
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 6
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: cluster-with-shared-volume-m
        ports:
        - containerPort: 7077
          name: spark-master
          protocol: TCP
        - containerPort: 8080
          name: spark-webui
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - curl -s localhost:8080 | grep -e Status.*ALIVE
          failureThreshold: 3
          initialDelaySeconds: 2
          periodSeconds: 7
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1

reynoldsm88 commented 5 years ago

@Jiri-Kremser I spoke with @elmiko earlier today (last day on Slack :,(...) and it seems like we're going to have to alter the schema and do something similar to this.

I'll get to work on this over the weekend/next week. Sound good?

jkremser commented 5 years ago

:+1: sounds good

I don't see anything about the volumes there

yes, because currently there are none (in the default case)

have to alter the schema and do something similar to this.

yes, you are right it's probably not doable without changing the RCs and having this also in the schema of spark cluster. There is already something similar for the spark history server here. We can have something similar under master and worker section, now, it's called RCSpec and if it's there, we can also create the PVC based on the description.

note: also volumeMount for the main container and volume for the pod needs to be added, it's done here in case of spark history server and master's RC definition.

reynoldsm88 commented 5 years ago

@Jiri-Kremser It seems like there aren't tests for the CRD based approaches... the only things in the YAML processing tests focus on the ConfigMap based ones. Is CM the way you want to go moving forward? Otherwise, is there any particular framework I need to go through to parse the YAML the same way k8's does for my CRD example tests?

jkremser commented 5 years ago

@reynoldsm88 there are end-to-end tests for both, CRD approach and for CM approach, check:

here is example output for the CRD-based tests on minikube: https://travis-ci.org/radanalyticsio/spark-operator/jobs/515928988

reynoldsm88 commented 5 years ago

Thanks @Jiri-Kremser that should work too. I was wondering if there were YAML processing tests to validate the CRD schema as well. But those tests should also work for my purposes.

jeynesrya commented 5 years ago

Hi @reynoldsm88 & @jkremser - has any progress been made on this ticket? I am wanting to use this operator for spark clusters but seem to be unable to add a persistent volume claim even when it has bound to a pv and is available to use. Is there any changes I need to make in Openshift to allow this to work? For example, in the spark-on-k8s-operator SparkApplication CRD, I needed to enable MutatingAdmissionsWebhook in Openshift before I could mount volumes. Note: I am only trying this for the SparkCluster CRD, not SparkApplication.

jkremser commented 5 years ago

I am looking into this issue. @reynoldsm88 I don't want to step on your toes, though. Do you have some unmerged work related this issue?

radanalyticsio / spark-operator

Create example with persistent volumes #154