kubeflow / spark-operator

Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Apache License 2.0
2.77k stars 1.38k forks source link

Spark Thrift Server(STS) CRD #1116

Open nicholas-fwang opened 3 years ago

nicholas-fwang commented 3 years ago

Spark Thrift Server is a daemon server that can execute spark sql through JDBC/ODBC connector. It can be usefully used in hive's execution engine and BI tool that supports JDBC/ODBC.

I have deployed thrift server on Kubernetes as below.

apiVersion: v1
kind: Pod
metadata:
  name: spark-thrift-server
  labels:
    foo: bar
spec:
  containers:
  - name: spark-thrift-server
    image: gcr.io/spark-operator/spark:v3.0.0
    args:
      - /opt/spark/bin/spark-submit
      - --master
      - k8s://https://xx.xx.x.x:443
      - --class
      - org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
      - --deploy-mode
      - client
      - --name
      - spark-sql
      - --hiveconf
      - hive.server2.thrift.port 10000
      - --conf
      - spark.executor.instances=1
      - --conf
      - spark.executor.memory=1G
      - --conf
      - spark.driver.memory=1G
      - --conf
      - spark.executor.cores=1
      - --conf
      - spark.kubernetes.namespace=spark-operator
      - --conf
      - spark.kubernetes.container.image=gcr.io/spark-operator/spark:v3.0.0
      - --conf
      - spark.kubernetes.authenticate.driver.serviceAccountName=spark-operator
      - --conf
      - spark.kubernetes.driver.pod.name=spark-thrift-server
    ports:
    - containerPort: 4040
      name: spark-ui
      protocol: TCP
    - containerPort: 10000
      name: spark-thrift
      protocol: TCP
  serviceAccount: spark-operator
  serviceAccountName: spark-operator
---
apiVersion: v1
kind: Service
metadata:
  name: spark-thrift-server
spec:
  clusterIP: None
  ports:
  - name: spark-ui
    port: 4040
    protocol: TCP
    targetPort: 4040
  - name: spark-thrift
    port: 10000
    protocol: TCP
    targetPort: 10000
  selector:
    foo: bar
  sessionAffinity: None
  type: ClusterIP

Since STS must be distributed in client mode, the spark operator could not be used directly. Therefore, it is proposed to create a new CRD and deploy the STS through the spark operator. Deployment through the spark operator has the following advantages.

manuelneubergerwn commented 3 years ago

We had some problem to rewrite this pod into a deployment, but solved it by adding the "spark.kubernetes.driver.pod.name" and "spark.driver.host" to the conf, through enviroment variables. The problem seems to be that otherwise spark.kubernetes.driver.pod.name need to be the name of the created pod AND the service (which is a problem for deployments).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-thrift-server
  labels:
    foo: bar
spec:
  replicas: 1
  selector:
    matchLabels:
      foo: bar
  template:
    metadata:
      labels:
        foo: bar
    spec:
      containers:
      - name: spark-thrift-server
        image: gcr.io/spark-operator/spark:v3.0.0
        args:
          - /opt/bitnami/spark/bin/spark-submit
          - --master
          - k8s://https://xx.xx.x.x:443
          - --class
          - org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
          - --deploy-mode
          - client
          - --name
          - spark-sql
          - --hiveconf
          - hive.server2.thrift.port 10000
          - --conf
          - spark.executor.instances=1
          - --conf
          - spark.executor.memory=1G
          - --conf
          - spark.driver.memory=1G
          - --conf
          - spark.executor.cores=1
          - --conf
          - spark.kubernetes.namespace=spark-operator
          - --conf
          - spark.kubernetes.container.image=gcr.io/spark-operator/spark:v3.0.0
          - --conf
          - spark.kubernetes.authenticate.driver.serviceAccountName=spark-operator
          - --conf
          - spark.kubernetes.driver.pod.name=$(THRIFT_POD_NAME)
          - --conf
          - spark.driver.bindAddress=$(THRIFT_POD_IP)
          - --conf
          - spark.driver.host=spark-thrift-server
        env:
        - name: THRIFT_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: THRIFT_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        ports:
        - containerPort: 4040
          name: spark-ui
          protocol: TCP    
        - containerPort: 10000
          name: spark-thrift
          protocol: TCP
      serviceAccount: spark-operator
      serviceAccountName: spark-operator
---
apiVersion: v1
kind: Service
metadata:
  name: spark-thrift-server
spec:
  clusterIP: None
  ports:
  - name: spark-ui
    port: 4040
    protocol: TCP
    targetPort: 4040
  - name: spark-thrift
    port: 10000
    protocol: TCP
    targetPort: 10000
  selector:
    foo: bar
  sessionAffinity: None
  type: ClusterIP
dnskr commented 3 years ago

Hi @fisache and @ManuelNeubergerWN!

I use start-thriftserver.sh to run STS and cannot find a right way to stop it. This is a part of my code how I start and stop STS:

command:
- /bin/sh
- -c
- >
  /opt/spark/sbin/start-thriftserver.sh \
    --name sts \
    --hiveconf hive.server2.thrift.port=10000 \
    --conf spark.driver.host=$(hostname -I) \
    --conf spark.kubernetes.driver.pod.name=$(hostname);
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "/opt/spark/sbin/stop-thriftserver.sh > /proc/1/fd/1"]

On termination preStop hook writes to log the following:

Reading package lists...
Building dependency tree...
Reading state information...
curl is already the newest version (7.64.0-4+deb10u2).
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.
no org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 to stop

Do you have any suggestions how this might be solved?

manuelneubergerwn commented 3 years ago

We are only using this on Kubernetes where no .sh script is required. (Or am I missing your context here?) In our context we allways have the server running combined with some spot instanzes which he can scale up and down as required.

dnskr commented 3 years ago

Sorry, I think my previous message was not clear enough :) I use Kubernetes too. So I have implemented Deployment similar to yours, but instead of spark-submit and long args list I use start-thriftserver.sh and ConfigMap. Here is my deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sts
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sts
  template:
    metadata:
      labels:
        app: sts
    spec:
      containers:
        - name: spark-thrift-server
          image: gcr.io/spark-operator/spark:v3.1.1
          command:
          - /bin/sh
          - -c
          - >
            /opt/spark/sbin/start-thriftserver.sh \
              --name {{ .Release.Name }} \
              --conf spark.driver.host=$(hostname -I) \
              --conf spark.kubernetes.driver.pod.name=$(hostname);
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "/opt/spark/sbin/stop-thriftserver.sh > /proc/1/fd/1"]
          env:
            - name: SPARK_CONF_DIR
              value: "/opt/spark/conf"
          ports:
            - name: ui
              containerPort: 4040
            - name: thrift
              containerPort: 10000
          volumeMounts:
            - name: config-volume
              mountPath: /opt/spark/conf
      terminationGracePeriodSeconds: 60
      volumes:
        - name: config-volume
          configMap:
            name: sts

So the question is how to stop it in a right way? The script stop-thriftserver.sh in preStop hook writes no org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 to stop

manuelneubergerwn commented 3 years ago

Sorry I am afraid I do not know the answer :/

dnskr commented 3 years ago

What confuses me is the fact that I cannot find any open source helm chart or other good way to run STS on Kubernetes. Spark Operator is promising but looks like it is not actively developing project anymore, so client mode (which is needed to run STS) will not be implemented in nearest future.

mbode-asys commented 2 years ago

Any news on this? Would be interesting to have native spark k8s operator support...

sangeethsasidharan commented 2 years ago

@dnskr , But the above implementation create the master and executor inside a single pod right ? .. That is not a scalable approach right

ruslanguns commented 8 months ago

wow this case has been opened since 2020, is there existing any approach to expose a service like this? thanks in advanced1

dnskr commented 8 months ago

Hi @ruslanguns! I would recommend to look at the Apache Kyuubi project. It allows to run Spark SQL engines on demand with different share levels (connection, user, group and server) and has other features:

ruslanguns commented 8 months ago

It looks great. Thanks for sharing it! I'll take a look

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.