intel-analytics / BigDL-2.x

BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
https://bigdl.readthedocs.io
Apache License 2.0
2.66k stars 729 forks source link

Allocation Error in PPML Image in Kubernetes: Initial Job Has Not Accepted Any Resource #5178

Closed vi0eros closed 2 months ago

vi0eros commented 2 months ago

Description:

When executing the script ppml/trusted-bigdata/scripts/start-pyspark-pi-on-k8s-client-sgx.sh using the $RUNTIME_K8S_SPARK_IMAGE, I encountered a resource allocation error. The error log shows the following message repeatedly:

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

However, when using theapache/spark:v3.1.3 image, the job runs successfully and completes as expected.

Script Details:

The script ppml/trusted-bigdata/scripts/start-pyspark-pi-on-k8s-client-sgx.sh is as follows:

#!/bin/bash

export mode=client && \
secure_password=`openssl rsautl -inkey /ppml/password/key.txt -decrypt </ppml/password/output.bin` && \
TF_MKL_ALLOC_MAX_BYTES=10737418240 && \
SPARK_LOCAL_IP=$LOCAL_IP && \
export sgx_command="/opt/jdk8/bin/java \
  -cp /ppml/spark-${SPARK_VERSION}/conf/:/ppml/spark-${SPARK_VERSION}/jars/*:ppml/trusted-big-data-ml/work/spark-${SPARK_VERSION}/examples/jars/*:/ppml/spark-${SPARK_VERSION}/examples/jars/* \
    -Xmx1g \
    org.apache.spark.deploy.SparkSubmit \
    --master k8s://https://10.170.58.170:6443 \
    --deploy-mode $mode \
    --name spark-pi-sgx \
    --conf spark.driver.host=$LOCAL_IP \
    --conf spark.driver.port=54321 \
    --conf spark.executor.instances=2 \
    --conf spark.executor.memory=1g \
    --conf spark.executor.cores=1 \
    --conf spark.kubernetes.driver.connectionTimeout=60000 \
    --conf spark.kubernetes.driver.requestTimeout=60000 \
    --conf spark.kubernetes.submission.connectionTimeout=60000 \
    --conf spark.kubernetes.submission.requestTimeout=60000 \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=$RUNTIME_K8S_SPARK_IMAGE \
    --conf spark.kubernetes.driver.podTemplateFile=/ppml/spark-driver-template.yaml \
    --conf spark.kubernetes.executor.podTemplateFile=/ppml/spark-executor-template.yaml \
    --conf spark.kubernetes.executor.deleteOnTermination=false \
    --conf spark.network.timeout=10000000 \
    --conf spark.executor.heartbeatInterval=10000000 \
    --conf spark.python.use.daemon=false \
    --conf spark.python.worker.reuse=false \
    --conf spark.kubernetes.sgx.enabled=false \
    --conf spark.kubernetes.sgx.driver.jvm.mem=1g \
    --conf spark.kubernetes.sgx.executor.jvm.mem=3g \
    --conf spark.kubernetes.sgx.log.level=error \
    --conf spark.authenticate=true \
    --conf spark.authenticate.secret=$secure_password \
    --conf spark.kubernetes.executor.secretKeyRef.SPARK_AUTHENTICATE_SECRET="spark-secret:secret" \
    --conf spark.kubernetes.driver.secretKeyRef.SPARK_AUTHENTICATE_SECRET="spark-secret:secret" \
    --conf spark.authenticate.enableSaslEncryption=true \
    --conf spark.network.crypto.enabled=true \
    --conf spark.network.crypto.keyLength=128 \
    --conf spark.network.crypto.keyFactoryAlgorithm=PBKDF2WithHmacSHA1 \
    --conf spark.io.encryption.enabled=false \
    --conf spark.io.encryption.keySizeBits=128 \
    --conf spark.io.encryption.keygen.algorithm=HmacSHA1 \
    --conf spark.ssl.enabled=true \
    --conf spark.ssl.port=8043 \
    --conf spark.ssl.keyPassword=$secure_password \
    --conf spark.ssl.keyStore=/ppml/keys/keystore.jks \
    --conf spark.ssl.keyStorePassword=$secure_password \
    --conf spark.ssl.keyStoreType=JKS \
    --conf spark.ssl.trustStore=/ppml/keys/keystore.jks \
    --conf spark.ssl.trustStorePassword=$secure_password \
    --conf spark.ssl.trustStoreType=JKS \
    --class org.apache.spark.examples.SparkPi \
    --verbose \
    --jars local:///ppml/spark-${SPARK_VERSION}/examples/jars/spark-examples_2.12-${SPARK_VERSION}.jar \
    local:///ppml/spark-${SPARK_VERSION}/examples/jars/spark-examples_2.12-${SPARK_VERSION}.jar 3000"
gramine-sgx bash 2>&1 | tee spark-pi-client-sgx.log

Logs with PPML Image:

24-07-07 07:34:14 [Timer-0] WARN  TaskSchedulerImpl:69 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24-07-07 07:34:29 [Timer-0] WARN  TaskSchedulerImpl:69 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24-07-07 07:34:30 [kubernetes-executor-snapshots-subscribers-0] INFO  ExecutorPodsAllocator:57 - Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 1.
24-07-07 07:34:30 [kubernetes-executor-snapshots-subscribers-0] INFO  BasicExecutorFeatureStep:57 - Decommissioning not enabled, skipping shutdown script
...

Script with Apache Spark Image:

    --conf spark.kubernetes.container.image=apache/spark:v3.1.3 \
    ...
    --jars local:///opt/spark/examples/jars/spark-examples_2.12-${SPARK_VERSION}.jar \
    local:///opt/spark/examples/jars/spark-examples_2.12-${SPARK_VERSION}.jar 3000"

Logs with Apache Spark Image:

24-07-07 08:55:06 [dag-scheduler-event-loop] INFO  TaskSchedulerImpl:57 - Killing all running tasks in stage 0: Stage finished
24-07-07 08:55:06 [main] INFO  DAGScheduler:57 - Job 0 finished: reduce at SparkPi.scala:38, took 82.806856 s
Pi is roughly 3.1414868371382894
24-07-07 08:55:06 [main] INFO  AbstractConnector:381 - Stopped Spark@4dd77110{SSL, (ssl, http/1.1)}{127.0.0.1:4440}
24-07-07 08:55:06 [main] INFO  AbstractConnector:381 - Stopped HttpsRedirect@224120e7{HTTP/1.1, (http/1.1)}{127.0.0.1:4040}
24-07-07 08:55:06 [main] INFO  SparkUI:57 - Stopped Spark web UI at https://10.170.58.170:4440
...

Additional Information:

Building the PPML Image:

The PPML image is built using the following script ppml/trusted-bigdata/custom-image/build-custom-image.sh:

#!/bin/bash

export CUSTOM_IMAGE_NAME=registry.bigdata.xdu.com/bigdl-ppml-trusted-bigdata-gramine-custom
export CUSTOM_IMAGE_TAG=2.5.0-SNAPSHOT
export BASE_IMAGE_NAME=bigdl-ppml-trusted-bigdata-gramine-base
export BASE_IMAGE_TAG=2.5.0-SNAPSHOT
export SGX_MEM_SIZE=8G
export SGX_LOG_LEVEL=error
export ENABLE_DCAP_ATTESTATION=false

if [[ "$SGX_MEM_SIZE" == "memory_size_of_sgx_in_custom_image" ]] || [[ "$SGX_LOG_LEVEL" == "log_level_of_sgx_in_custom_image" ]]
then
    echo "Please specific SGX_MEM_SIZE and SGX_LOG_LEVEL."
else
    sudo docker build \
        --network host \
        --build-arg BASE_IMAGE_NAME=${BASE_IMAGE_NAME} \
        --build-arg BASE_IMAGE_TAG=${BASE_IMAGE_TAG} \
        --build-arg SGX_MEM_SIZE=${SGX_MEM_SIZE} \
        --build-arg SGX_LOG_LEVEL=${SGX_LOG_LEVEL} \
        --build-arg ENABLE_DCAP_ATTESTATION=${ENABLE_DCAP_ATTESTATION} \
        -t ${CUSTOM_IMAGE_NAME}:${CUSTOM_IMAGE_TAG} \
        -f ./Dockerfile .
fi

Starting the PPML Image:

The PPML image is started using the following commands:

#!/bin/bash

export K8S_MASTER=k8s://$(sudo kubectl cluster-info | grep 'https.*6443' -o -m 1)
export NFS_INPUT_PATH=/opt/BigDL/data
export KEYS_PATH=/opt/BigDL/keys
export SECURE_PASSWORD_PATH=/opt/BigDL/password
export KUBECONFIG_PATH=/opt/BigDL/k8s/config
export LOCAL_IP=10.170.58.170
export DOCKER_IMAGE=registry.bigdata.xdu.com/bigdl-ppml-trusted-bigdata-gramine-custom:2.5.0-SNAPSHOT

sudo docker run -itd \
    --net=host \
    --name=gramine-bigdata-8g \
    --cpuset-cpus=10 \
    --oom-kill-disable \
    --device=/dev/sgx/enclave \
    --device=/dev/sgx/provision \
    -v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
    -v $KEYS_PATH:/ppml/keys \
    -v $SECURE_PASSWORD_PATH:/ppml/password \
    -v $KUBECONFIG_PATH:/root/.kube/config \
    -v $NFS_INPUT_PATH:/ppml/data \
    -e RUNTIME_SPARK_MASTER=$K8S_MASTER \
    -e RUNTIME_K8S_SPARK_IMAGE=$DOCKER_IMAGE \
    -e RUNTIME_DRIVER_PORT=54321 \
    -e RUNTIME_DRIVER_MEMORY=1g \
    -e LOCAL_IP=$LOCAL_IP \
    $DOCKER_IMAGE bash

I am a beginner, please help me.

qiyuangong commented 2 months ago

Hi @vi0eros PPML Image is designed for SGX enabled platform (recommended platform is 3rd-4th Xeon platform with SGX enabled on BIOS). If you can finish running the example with apache/spark:v3.1.3 image, while get errors with PPML image. Please check if SGX is enabled on your platform and then if SGX is set up with enough EPC (SGX reserved memory).

BTW, Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources is just a warning. It will keep show this message until job gets enough resource.

vi0eros commented 2 months ago

Hi @vi0eros PPML Image is designed for SGX enabled platform (recommended platform is 3rd-4th Xeon platform with SGX enabled on BIOS). If you can finish running the example with apache/spark:v3.1.3 image, while get errors with PPML image. Please check if SGX is enabled on your platform and then if SGX is set up with enough EPC (SGX reserved memory).

BTW, Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources is just a warning. It will keep show this message until job gets enough resource.

@qiyuangong

Thank you for your response. I have verified that I am indeed running on an SGX-enabled platform. However, my EPC memory is somewhat limited. To address this, I have set the following parameter in start-pyspark-pi-on-k8s-client-sgx.sh:

--conf spark.kubernetes.sgx.enabled=false

Specifically, I am wondering if the PPML image can run without SGX enabled, given that my EPC memory is quite limited.

I have also removed the resources section from spark-executor-template.yaml. The updated content of spark-executor-template.yaml is as follows:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: spark-executor
    env:
      - name: ATTESTATION
        value: false
      - name: ATTESTATION_URL
        value: your_attestation_url
      - name: MALLOC_ARENA_MAX
        value: 4
    volumeMounts:
      - name: device-plugin
        mountPath: /var/lib/kubelet/device-plugins
      - name: aesm-socket
        mountPath: /var/run/aesmd/aesm.socket
      - name: nfs-storage
        mountPath: /ppml/data
    # Removed resources section
  volumes:
    - name: device-plugin
      hostPath:
        path: /var/lib/kubelet/device-plugins
    - name: aesm-socket
      hostPath:
        path: /var/run/aesmd/aesm.socket
    - name: nfs-storage
      persistentVolumeClaim:
        claimName: nfsvolumeclaim

Is it possible to run the PPML image in a non-SGX mode, and if so, are there any additional configurations or adjustments needed to ensure that it functions correctly under these conditions?

Thank you for your assistance.

qiyuangong commented 2 months ago

Hi @vi0eros PPML Image is designed for SGX enabled platform (recommended platform is 3rd-4th Xeon platform with SGX enabled on BIOS). If you can finish running the example with apache/spark:v3.1.3 image, while get errors with PPML image. Please check if SGX is enabled on your platform and then if SGX is set up with enough EPC (SGX reserved memory). BTW, Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources is just a warning. It will keep show this message until job gets enough resource.

@qiyuangong

Thank you for your response. I have verified that I am indeed running on an SGX-enabled platform. However, my EPC memory is somewhat limited. To address this, I have set the following parameter in start-pyspark-pi-on-k8s-client-sgx.sh:

--conf spark.kubernetes.sgx.enabled=false

Specifically, I am wondering if the PPML image can run without SGX enabled, given that my EPC memory is quite limited.

I have also removed the resources section from spark-executor-template.yaml. The updated content of spark-executor-template.yaml is as follows:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: spark-executor
    env:
      - name: ATTESTATION
        value: false
      - name: ATTESTATION_URL
        value: your_attestation_url
      - name: MALLOC_ARENA_MAX
        value: 4
    volumeMounts:
      - name: device-plugin
        mountPath: /var/lib/kubelet/device-plugins
      - name: aesm-socket
        mountPath: /var/run/aesmd/aesm.socket
      - name: nfs-storage
        mountPath: /ppml/data
    # Removed resources section
  volumes:
    - name: device-plugin
      hostPath:
        path: /var/lib/kubelet/device-plugins
    - name: aesm-socket
      hostPath:
        path: /var/run/aesmd/aesm.socket
    - name: nfs-storage
      persistentVolumeClaim:
        claimName: nfsvolumeclaim

Is it possible to run the PPML image in a non-SGX mode, and if so, are there any additional configurations or adjustments needed to ensure that it functions correctly under these conditions?

Thank you for your assistance.

Hi @vi0eros BigDL PPML is mainly designed for TEE (SGX is one of them). Without the protection of TEE, we cannot ensure the confidentiality and integrity of the main components.

The example and image you chose are designed to run Apache Spark within Intel SGX. spark.kubernetes.sgx.enabled=false is just for debugging, not for production. If this example is not running in SGX, we cannot ensure spark components are fully secured. In that case, changing to an Apache Spark image may be a better choice.

vi0eros commented 2 months ago

Given this information, I have decided to switch to using an Apache Spark image for my needs. This resolves my issue. Thank you for your assistance.

qiyuangong commented 2 months ago

Given this information, I have decided to switch to using an Apache Spark image for my needs. This resolves my issue. Thank you for your assistance.

You are welcome. :)