intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

https://analytics-zoo.readthedocs.io/

Apache License 2.0

11 stars 3 forks source link

[BigDL 2.0] examples on k8s integration tests client mode on new image #23

Closed piaolaidelangman closed 2 years ago

piaolaidelangman commented 2 years ago

Module	Example	Client Mode
nnframes	ImageInferenceExample.py	Succeed
nnframes	ImageTransferLearningExample.py	Succeed
pytorch	learn/pytorch/cifar10/cifar10.py	Succeed
pytorch	learn/pytorch/fashion_mnist/fashion_mnist.py	Succeed
pytorch	learn/pytorch/super_resolution/super_resolution.py	Succeed
tf	learn/tf/basic_text_classification/basic_text_classification.py	Succeed
tf	learn/tf/transfer_learning/transfer_learning.py	Succeed
tf	learn/tf/inception/inception.py	Succeed
tf	learn/tf/image_segmentation/image_segmentation.py	Succeed
tf2	learn/tf2/yolov3/yoloV3.py	Succeed
torchmodel	torchmodel/train/imagenet/main.py	Succeed
torchmodel	torchmodel/train/mnist/main.py	Succeed
torchmodel	torchmodel/train/resnet_finetune/resnet_finetune.py	Succeed

piaolaidelangman commented 2 years ago

Module automl

autoestimator_pytorch.py

Client Command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--autoestimator_pytorch \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/automl/autoestimator/autoestimator_pytorch.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/automl/autoestimator/autoestimator_pytorch.py \
  --cluster_mode "spark-submit"`

Client Exception

(raylet, ip=172.30.39.4) Traceback (most recent call last):
(raylet, ip=172.30.39.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
(raylet, ip=172.30.39.4)     import ray.new_dashboard.utils as dashboard_utils
(raylet, ip=172.30.39.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
(raylet, ip=172.30.39.4)     import aiohttp.signals
(raylet, ip=172.30.39.4) ModuleNotFoundError: No module named 'aiohttp.signals'

In virtual env pytf1, pip list:

aiohttp                  3.7.0
aiohttp-cors             0.7.0
aioredis                 1.1.0
aiosignal                1.2.0

piaolaidelangman commented 2 years ago

Module automl

AutoXGBoostClassifier.py

Client Command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--autoxgboost-classifier \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/automl/autoxgboost/AutoXGBoostClassifier.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/automl/autoxgboost/AutoXGBoostClassifier.py \
  --path /bigdl2.0/data/airline_14col.data \
  --cluster_mode "spark-submit"

Client Exception

Number of trials: 1/4 (1 RUNNING)

(raylet, ip=172.30.27.4) Traceback (most recent call last):
(raylet, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
(raylet, ip=172.30.27.4)     import ray.new_dashboard.utils as dashboard_utils
(raylet, ip=172.30.27.4)   File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
(raylet, ip=172.30.27.4)     import aiohttp.signals
(raylet, ip=172.30.27.4) ModuleNotFoundError: No module named 'aiohttp.signals'
(pid=235, ip=172.30.27.4) [0]   validation_0-error:0.15600
(pid=235, ip=172.30.27.4) [1]   validation_0-error:0.15600
(pid=235, ip=172.30.27.4) /usr/local/envs/pytf1/lib/python3.7/site-packages/xgboost/sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
(pid=235, ip=172.30.27.4)   warnings.warn(label_encoder_deprecation_msg, UserWarning)
(pid=235, ip=172.30.27.4) /usr/local/envs/pytf1/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:98: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
(pid=235, ip=172.30.27.4)   y = column_or_1d(y, warn=True)
(pid=235, ip=172.30.27.4) /usr/local/envs/pytf1/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:133: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
(pid=235, ip=172.30.27.4)   y = column_or_1d(y, warn=True)
(pid=235, ip=172.30.27.4) [2]   validation_0-error:0.15600
(pid=235, ip=172.30.27.4) [3]   validation_0-error:0.15600

AutoXGBoostRegressor.py has same error.

piaolaidelangman commented 2 years ago

super_resolution.py Module: pytorch

Client Command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--super_resolution \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/super_resolution/super_resolution.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/super_resolution/super_resolution.py \
  --cluster_mode "spark-submit"

Client Exception

creating: createMaxEpoch
2021-11-03 08:30:34 ERROR TaskSetManager:73 - Task 1 in stage 1.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/cifar10/cifar10.py", line 151, in <module>
    checkpoint_trigger=EveryEpoch())
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/estimator.py", line 398, in fit
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/estimator.py", line 324, in _handle_data_loader
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/feature/common.py", line 389, in pytorch_dataloader
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.createFeatureSetFromPyTorch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 16) (172.30.39.4 executor 1):
 java.lang.RuntimeException: PYTHONHOME is unset, please set PYTHONHOME first.

Run :

echo $PYTHONHOME

get :

/usr/local/envs/pytf1

Example cifar10.py and fashion_mnist.py have same exception.

Solution

Add --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \

piaolaidelangman commented 2 years ago

transfer_learning.py Module: tf

Client Command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--transfer_learning \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py \
  --cluster_mode "spark-submit"

Client Exception

BigDLBasePickler registering: bigdl.dllib.utils.common  JActivity
Total training cat images: 1000
Total training dog images: 1000
Total validation cat images: 500
Total validation dog images: 500
Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py", line 99, in <module>
    builder = tfds.ImageFolder(base_dir)
AttributeError: module 'tensorflow_datasets' has no attribute 'ImageFolder'
Stopping orca context

In virtual env pytf1, pip list:

tensorflow-datasets      2.0.0

Solution

Requries tensorflow-datasets==3.2.0 and h5py < 3.0.0, which has updated in Dockerfile.

ManfeiBai commented 2 years ago

Inception.py

Client Command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--basic_text_classification \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
  --folder /tmp/imagenet_to_tfrecord \
  --imagenet /tmp/imagenettfrecord/tfrecord \
  --cluster_mode yarn --worker_num 4 \
  --cores 54 --memory 175G --batchSize 1792 \
  --maxIteration 62000 --maxEpoch 100 --learningRate 0.0896 \
  --checkpoint /tmp/models/inception \
  --cluster_mode "spark-submit"

Client Exception

2021-11-04 01:22:09 INFO  DistriOptimizer$:162 - Count dataset
2021-11-04 01:22:10 ERROR TaskSetManager:73 - Task 0 in stage 7.0 failed 4 times; aborting job
2021-11-04 01:22:10 ERROR DistriOptimizer$:1293 - Error: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
    at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
    at com.intel.analytics.bigdl.dllib.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:1068)
    at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1268)
    at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1481)
    at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1151)
    at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:191)
    at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:119)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 41) (172.30.27.4 executor 1): org.tensorflow.TensorFlowException: /tmp/imagenettfrecord/tfrecord/train/train-00000-of-01024; No such file or directory
     [[{{node IteratorGetNext}}]]
    at org.tensorflow.Session.run(Native Method)
    at org.tensorflow.Session.access$100(Session.java:48)
    at org.tensorflow.Session$Runner.runHelper(Session.java:326)
    at org.tensorflow.Session$Runner.run(Session.java:276)
    at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.$anonfun$run$5(GraphRunner.scala:133)
    at com.intel.analytics.bigdl.dllib.common.zooUtils$.timeIt(zooUtils.scala:42)
    at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.$anonfun$run$1(GraphRunner.scala:133)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.intel.analytics.bigdl.dllib.common.zooUtils$.timeIt(zooUtils.scala:42)
    at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.run(GraphRunner.scala:113)
    at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.runOutputs(GraphRunner.scala:102)
    at com.intel.analytics.bigdl.orca.tfpark.TFDataFeatureSet$$anon$2.getNext(TFDataFeatureSet.scala:233)
    at com.intel.analytics.bigdl.orca.tfpark.TFDataFeatureSet$$anon$2.hasNext(TFDataFeatureSet.scala:221)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.rdd.RDD.$anonfun$reduce$2(RDD.scala:1105)
    at org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2290)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2291)
    at org.apache.spark.rdd.RDD.$anonfun$reduce$1(RDD.scala:1120)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:1102)
    at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:164)
    ... 23 more
Caused by: org.tensorflow.TensorFlowException: /tmp/imagenettfrecord/tfrecord/train/train-00000-of-01024; No such file or directory
     [[{{node IteratorGetNext}}]]
    at org.tensorflow.Session.run(Native Method)
    at org.tensorflow.Session.access$100(Session.java:48)
    at org.tensorflow.Session$Runner.runHelper(Session.java:326)
    at org.tensorflow.Session$Runner.run(Session.java:276)
    at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.$anonfun$run$5(GraphRunner.scala:133)
    at com.intel.analytics.bigdl.dllib.common.zooUtils$.timeIt(zooUtils.scala:42)
    at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.$anonfun$run$1(GraphRunner.scala:133)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.intel.analytics.bigdl.dllib.common.zooUtils$.timeIt(zooUtils.scala:42)
    at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.run(GraphRunner.scala:113)
    at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.runOutputs(GraphRunner.scala:102)
    at com.intel.analytics.bigdl.orca.tfpark.TFDataFeatureSet$$anon$2.getNext(TFDataFeatureSet.scala:233)
    at com.intel.analytics.bigdl.orca.tfpark.TFDataFeatureSet$$anon$2.hasNext(TFDataFeatureSet.scala:221)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.rdd.RDD.$anonfun$reduce$2(RDD.scala:1105)
    at org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2290)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

2021-11-04 01:22:10 INFO  DistriOptimizer$:1307 - Retrying 1 times
Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py", line 282, in <module>
    checkpoint_trigger=checkpoint_trigger)
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 593, in fit
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_optimizer.py", line 776, in optimize
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/estimator/estimator.py", line 167, in train_minibatch
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o75.estimatorTrainMiniBatch.
: java.lang.NullPointerException
    at com.intel.analytics.bigdl.dllib.optim.AbstractOptimizer.clearState(AbstractOptimizer.scala:241)
    at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer.clearState(DistriOptimizer.scala:757)
    at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1311)
    at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1481)
    at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1151)
    at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:191)
    at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:119)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Stopping orca context

piaolaidelangman commented 2 years ago

torchmodel/imagenet

Client command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=172.16.0.200 \
  --conf spark.driver.port=54321 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name analytics-zoo-autoestimator \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py \
  /bigdl2.0/data/imagenet

Client exception

Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py", line 153, in <module>
    main()
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py", line 149, in main
    validation_method=[Accuracy(), Top5Accuracy()])
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/estimator/estimator.py", line 167, in train_minibatch
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o76.estimatorTrainMiniBatch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8) (Almaren-Node-200 executor driver): 
jep.JepException: jep.JepException: <class 'AttributeError'>: module 'types' has no attribute 'ClassType'
        at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)

Example torchmodel/resnet_finetune and /torchmodel/mnist have same exception.

ManfeiBai commented 2 years ago

image_segemnetation.py

command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--basic_text_classification \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py \
  --cluster_mode "spark-submit"

Exception

Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 223, in <module>
    args.non_interactive)
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 169, in main
    epochs=max_epoch)
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 871, in fit
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 397, in to_dataset
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/utils.py", line 54, in xshards_to_tf_dataset
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 381, in from_rdd
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1162, in from_rdd
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1083, in __init__
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 153, in __init__
ValueError: batch_size should be a multiple of total core number, but got batch_size: 8 where total core number is 64
Build step 'Execute shell' marked build as failure
Finished: FAILURE

piaolaidelangman commented 2 years ago

pytorch/cifar10.py

Client Command

 ${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--cifar10 \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/cifar10/cifar10.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/cifar10/cifar10.py \
  --cluster_mode "spark-submit"

Client Exception

creating: createEveryEpoch
creating: createMaxEpoch
2021-11-05 02:18:09 ERROR TaskSetManager:73 - Task 2 in stage 1.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/cifar10/cifar10.py", line 152, in <module>
    checkpoint_trigger=EveryEpoch())
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/estimator.py", line 398, in fit
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/estimator.py", line 324, in _handle_data_loader
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/feature/common.py", line 389, in pytorch_dataloader
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.createFeatureSetFromPyTorch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 16) (172.30.27.4 executor 1): 
jep.JepException: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'bigdl.orca'
        at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
        ............

Examples super_resolution.py and fashion_mnist.py also in this module have same exception.

ManfeiBai commented 2 years ago

tf2/resnet/resnet-50-imagenet.py

command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--basic_text_classification \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/resnet/resnet-50-imagenet.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/resnet/resnet-50-imagenet.py \
  --cluster_mode standalone --worker_num 8 --cores 17 \
  --data_dir /tmp/imagenettfrecord/tfrecord --use_bf16 \
  --enable_numa_binding \
  --cluster_mode "spark-submit"

Exception

Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/resnet/resnet-50-imagenet.py", line 371, in <module>
    enable_numa_binding=args.enable_numa_binding)
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/common.py", line 268, in init_orca_context
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/ray/raycontext.py", line 540, in init
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/ray/raycontext.py", line 568, in _start_cluster
  File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 949, in collect
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(1, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/__init__.py", line 21, in <module>
    prepare_env()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 171, in prepare_env
    __prepare_analytics_zoo_env()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 74, in __prepare_analytics_zoo_env
    analytics_zoo_classpath = get_analytics_zoo_classpath()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 116, in get_analytics_zoo_classpath
    raise ValueError("Path {} specified BIGDL_CLASSPATH does not exist.".format(path))
ValueError: Path /opt/bigdl-0.14.0-SNAPSHOT/jars/*.jar specified BIGDL_CLASSPATH does not exist.

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
    at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
    at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
    at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
    at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
    at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
    at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1968)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2442)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
    at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

piaolaidelangman commented 2 years ago

tf/transfer_learning.py

Client Command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client-transfer_learning \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py \
  --cluster_mode "spark-submit"

Client Exception

creating: createMaxEpoch
creating: createEveryEpoch
2021-11-05 02:54:41 INFO  DistriOptimizer$:824 - caching training rdd ...
2021-11-05 02:54:41 INFO  DistriOptimizer$:650 - Cache thread models...
2021-11-05 02:54:43 INFO  DistriOptimizer$:652 - Cache thread models... done
2021-11-05 02:54:43 INFO  DistriOptimizer$:162 - Count dataset
2021-11-05 02:54:44 ERROR TaskSetManager:73 - Task 1 in stage 9.0 failed 4 times; aborting job
2021-11-05 02:54:44 ERROR DistriOptimizer$:1293 - Error: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
        at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
        at com.intel.analytics.bigdl.dllib.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:1068)
        at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1268)
        at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1481)
        at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1151)
        at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:191)
        at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:119)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 9.0 failed 4 times, most recent failure: Lost task 1.3 in stage 9.0 (TID 48) (172.30.39.4 executor 1): org.tensorflow.TensorFlowException: {{function_node __inference_Dataset_map__load_example_102}} 
**./datasets/cats_and_dogs_filtered/train/dogs/dog.807.jpg; No such file or directory**
         [[{{node ReadFile}}]]
         [[IteratorGetNext]]
        at org.tensorflow.Session.run(Native Method)
        at org.tensorflow.Session.access$100(Session.java:48)
        ................

However ,run : ll ./datasets/cats_and_dogs_filtered/train/dogs/dog.807.jpg got : -rw-r--r-- 1 root root 20189 Nov 5 02:53 ./datasets/cats_and_dogs_filtered/train/dogs/dog.807.jpg The file do exist.

Solution

The file path need to be nfs path /bigdl2.0/data/datasets

piaolaidelangman commented 2 years ago

torchmodel/imagenet

Client Command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=172.16.0.200 \
  --conf spark.driver.port=54321 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client-torchmodel-imagenet \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py \
  /bigdl2.0/data/imagenet

Client Exception

creating: createTorchLoss
creating: createEstimator
2021-11-05 03:04:14 ERROR Executor:94 - Exception in task 0.0 in stage 1.0 (TID 1)
jep.JepException: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'pyspark'
        at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
        at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108)
        at com.intel.analytics.bigdl.orca.net.PythonFeatureSet$.$anonfun$loadPythonSet$1(PythonFeatureSet.scala:90)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'pyspark'
        at <string>.<module>(<string>:2)
        at jep.Jep.exec(Native Method)
        at jep.Jep.exec(Jep.java:478)
        at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.$anonfun$exec$1(PythonInterpreter.scala:106)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
        at scala.util.Success.$anonfun$map$1(Try.scala:255)
        at scala.util.Success.map(Try.scala:213)
        at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
        at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
        at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
2021-11-05 03:04:14 ERROR TaskSetManager:73 - Task 0 in stage 1.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py", line 153, in <module>
    main()
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py", line 145, in main
    train_featureSet = FeatureSet.pytorch_dataloader(train_loader)
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/feature/common.py", line 389, in pytorch_dataloader
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.createFeatureSetFromPyTorch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (Almaren-Node-200 executor driver): 
jep.JepException: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'pyspark'
        at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
          ..............

Example torchmodel/mnist has same exception.

Solution

source activate pytf1 export PYTHONHOME=/usr/local/envs/pytf1

piaolaidelangman commented 2 years ago

torchmodel/resnet_finetune.py

Client Command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=172.16.0.200 \
  --conf spark.driver.port=54321 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client-torchmodel-resnet_finetune \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/resnet_finetune/resnet_finetune.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/resnet_finetune/resnet_finetune.py \
  /bigdl2.0/data/dogscats

Client Exception

Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/resnet_finetune/resnet_finetune.py", line 104, in <module>
    catdogModel = classifier.fit(trainingDF)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 161, in fit
  File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 335, in _fit
  File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 332, in _fit_java
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o136.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 7) (Almaren-Node-200 executor driver): 
jep.JepException: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'bigdl'

Solution

source activate pytf1 PYTHONHOME=/usr/local/envs/pytf1

ManfeiBai commented 2 years ago

tf2/yolov3/yoloV3.py

command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--basic_text_classification \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/yolov3/yoloV3.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/yolov3/yoloV3.py \
  --data_dir /bigdl2.0/data/yolov3 \
  --weights /bigdl2.0/data/yolov3/yolov3.weights \
  --class_num 2 \
  --names /bigdl2.0/data/yolov3/voc2012.names \
  --cluster_mode "spark-submit"

Exception

2021-11-05 04:45:57 ERROR TaskSetManager:73 - Task 0 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/yolov3/yoloV3.py", line 695, in <module>
    main()
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/yolov3/yoloV3.py", line 638, in main
    splits_names=[(options.data_year, options.split_name_train)], classes=class_map)
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/data/image/parquet_dataset.py", line 337, in write_parquet
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/data/image/parquet_dataset.py", line 318, in write_voc
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/data/image/parquet_dataset.py", line 74, in write
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 675, in createDataFrame
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 698, in _create_dataframe
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 486, in _createFromRDD
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 460, in _inferSchema
  File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1586, in first
  File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1566, in take
  File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 1233, in runJob
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (172.30.39.4 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/__init__.py", line 21, in <module>
    prepare_env()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 171, in prepare_env
    __prepare_analytics_zoo_env()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 74, in __prepare_analytics_zoo_env
    analytics_zoo_classpath = get_analytics_zoo_classpath()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 116, in get_analytics_zoo_classpath
    raise ValueError("Path {} specified BIGDL_CLASSPATH does not exist.".format(path))
ValueError: Path /opt/bigdl-0.14.0-SNAPSHOT/jars/*.jar specified BIGDL_CLASSPATH does not exist.

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
    at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
    at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
    at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
    at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
    at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
    at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
    at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/__init__.py", line 21, in <module>
    prepare_env()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 171, in prepare_env
    __prepare_analytics_zoo_env()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 74, in __prepare_analytics_zoo_env
    analytics_zoo_classpath = get_analytics_zoo_classpath()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 116, in get_analytics_zoo_classpath
    raise ValueError("Path {} specified BIGDL_CLASSPATH does not exist.".format(path))
ValueError: Path /opt/bigdl-0.14.0-SNAPSHOT/jars/*.jar specified BIGDL_CLASSPATH does not exist.

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
    at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
    at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
    at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
    at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
    at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
    at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

Le-Zheng commented 2 years ago

image_segemnetation.py

command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--basic_text_classification \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py \
  --cluster_mode "spark-submit"

Exception

Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 223, in <module>
    args.non_interactive)
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 169, in main
    epochs=max_epoch)
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 871, in fit
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 397, in to_dataset
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/utils.py", line 54, in xshards_to_tf_dataset
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 381, in from_rdd
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1162, in from_rdd
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1083, in __init__
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 153, in __init__
ValueError: batch_size should be a multiple of total core number, but got batch_size: 8 where total core number is 64
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Solution

set batch_size to 64 or 64*n. Or modify total core number.

ManfeiBai commented 2 years ago

image_segemnetation.py

command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--basic_text_classification \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py \
  --cluster_mode "spark-submit"

Exception

Traceback (most recent call last):
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 223, in <module>
    args.non_interactive)
  File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 169, in main
    epochs=max_epoch)
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 871, in fit
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 397, in to_dataset
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/utils.py", line 54, in xshards_to_tf_dataset
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 381, in from_rdd
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1162, in from_rdd
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1083, in __init__
  File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 153, in __init__
ValueError: batch_size should be a multiple of total core number, but got batch_size: 8 where total core number is 64
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Solution

set batch_size to 64 or 64*n. Or modify total core number.

this problem fixed, now the problem is about:

Command

${SPARK_HOME}/bin/spark-submit \
  --master ${RUNTIME_SPARK_MASTER} \
  --deploy-mode client \
  --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
  --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
  --name test-bigdl2-client--basic_text_classification \
  --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
  --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
  --conf spark.kubernetes.driver.label.az=true \
  --conf spark.kubernetes.executor.label.az=true \
  --conf spark.kubernetes.node.selector.spark=true \
  --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
  --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
  --conf spark.kubernetes.container.image.pullPolicy=Always \
  --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
  --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
  --executor-cores ${RUNTIME_EXECUTOR_CORES} \
  --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
  --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
  --driver-cores ${RUNTIME_DRIVER_CORES} \
  --driver-memory ${RUNTIME_DRIVER_MEMORY} \
  --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
  --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
  --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
  --conf spark.sql.catalogImplementation='in-memory' \
  --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
  --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
  local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py \
  --batch_size 64 \
  --file_path /bigdl2.0/data/carvana \
  --cluster_mode "spark-submit"

Exception

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
    return pickle.loads(obj, encoding=encoding)
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/__init__.py", line 21, in <module>
    prepare_env()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 171, in prepare_env
    __prepare_analytics_zoo_env()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 74, in __prepare_analytics_zoo_env
    analytics_zoo_classpath = get_analytics_zoo_classpath()
  File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 116, in get_analytics_zoo_classpath
    raise ValueError("Path {} specified BIGDL_CLASSPATH does not exist.".format(path))
ValueError: Path /opt/bigdl-0.14.0-SNAPSHOT/jars/* specified BIGDL_CLASSPATH does not exist.

ManfeiBai commented 2 years ago

tf and tf2 test in cluster: success: http://10.112.231.51:18888/view/BigDL-2.0-NB/job/BigDL2.0-K8s-ExampleTests-Part3-Cluster/1/console

ManfeiBai commented 2 years ago

tf and tf2 test in client: (without resnet)： http://10.112.231.51:18888/view/BigDL-2.0-NB/job/BigDL2.0-K8s-ExampleTests-Part3/