Closed piaolaidelangman closed 2 years ago
Module | Example | Client Mode |
---|---|---|
nnframes | ImageInferenceExample.py | Succeed |
nnframes | ImageTransferLearningExample.py | Succeed |
pytorch | learn/pytorch/cifar10/cifar10.py | Succeed |
pytorch | learn/pytorch/fashion_mnist/fashion_mnist.py | Succeed |
pytorch | learn/pytorch/super_resolution/super_resolution.py | Succeed |
tf | learn/tf/basic_text_classification/basic_text_classification.py | Succeed |
tf | learn/tf/transfer_learning/transfer_learning.py | Succeed |
tf | learn/tf/inception/inception.py | Succeed |
tf | learn/tf/image_segmentation/image_segmentation.py | Succeed |
tf2 | learn/tf2/yolov3/yoloV3.py | Succeed |
torchmodel | torchmodel/train/imagenet/main.py | Succeed |
torchmodel | torchmodel/train/mnist/main.py | Succeed |
torchmodel | torchmodel/train/resnet_finetune/resnet_finetune.py | Succeed |
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--autoestimator_pytorch \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/automl/autoestimator/autoestimator_pytorch.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/automl/autoestimator/autoestimator_pytorch.py \
--cluster_mode "spark-submit"`
(raylet, ip=172.30.39.4) Traceback (most recent call last):
(raylet, ip=172.30.39.4) File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
(raylet, ip=172.30.39.4) import ray.new_dashboard.utils as dashboard_utils
(raylet, ip=172.30.39.4) File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
(raylet, ip=172.30.39.4) import aiohttp.signals
(raylet, ip=172.30.39.4) ModuleNotFoundError: No module named 'aiohttp.signals'
In virtual env pytf1, pip list:
aiohttp 3.7.0
aiohttp-cors 0.7.0
aioredis 1.1.0
aiosignal 1.2.0
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--autoxgboost-classifier \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/automl/autoxgboost/AutoXGBoostClassifier.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/automl/autoxgboost/AutoXGBoostClassifier.py \
--path /bigdl2.0/data/airline_14col.data \
--cluster_mode "spark-submit"
Number of trials: 1/4 (1 RUNNING)
(raylet, ip=172.30.27.4) Traceback (most recent call last):
(raylet, ip=172.30.27.4) File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
(raylet, ip=172.30.27.4) import ray.new_dashboard.utils as dashboard_utils
(raylet, ip=172.30.27.4) File "/usr/local/envs/pytf1/lib/python3.7/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
(raylet, ip=172.30.27.4) import aiohttp.signals
(raylet, ip=172.30.27.4) ModuleNotFoundError: No module named 'aiohttp.signals'
(pid=235, ip=172.30.27.4) [0] validation_0-error:0.15600
(pid=235, ip=172.30.27.4) [1] validation_0-error:0.15600
(pid=235, ip=172.30.27.4) /usr/local/envs/pytf1/lib/python3.7/site-packages/xgboost/sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
(pid=235, ip=172.30.27.4) warnings.warn(label_encoder_deprecation_msg, UserWarning)
(pid=235, ip=172.30.27.4) /usr/local/envs/pytf1/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:98: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
(pid=235, ip=172.30.27.4) y = column_or_1d(y, warn=True)
(pid=235, ip=172.30.27.4) /usr/local/envs/pytf1/lib/python3.7/site-packages/sklearn/preprocessing/_label.py:133: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
(pid=235, ip=172.30.27.4) y = column_or_1d(y, warn=True)
(pid=235, ip=172.30.27.4) [2] validation_0-error:0.15600
(pid=235, ip=172.30.27.4) [3] validation_0-error:0.15600
AutoXGBoostRegressor.py has same error.
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--super_resolution \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/super_resolution/super_resolution.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/super_resolution/super_resolution.py \
--cluster_mode "spark-submit"
creating: createMaxEpoch
2021-11-03 08:30:34 ERROR TaskSetManager:73 - Task 1 in stage 1.0 failed 4 times; aborting job
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/cifar10/cifar10.py", line 151, in <module>
checkpoint_trigger=EveryEpoch())
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/estimator.py", line 398, in fit
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/estimator.py", line 324, in _handle_data_loader
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/feature/common.py", line 389, in pytorch_dataloader
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.createFeatureSetFromPyTorch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 16) (172.30.39.4 executor 1):
java.lang.RuntimeException: PYTHONHOME is unset, please set PYTHONHOME first.
Run :
echo $PYTHONHOME
get :
/usr/local/envs/pytf1
Example cifar10.py and fashion_mnist.py have same exception.
Add --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--transfer_learning \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py \
--cluster_mode "spark-submit"
BigDLBasePickler registering: bigdl.dllib.utils.common JActivity
Total training cat images: 1000
Total training dog images: 1000
Total validation cat images: 500
Total validation dog images: 500
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py", line 99, in <module>
builder = tfds.ImageFolder(base_dir)
AttributeError: module 'tensorflow_datasets' has no attribute 'ImageFolder'
Stopping orca context
In virtual env pytf1, pip list:
tensorflow-datasets 2.0.0
Requries tensorflow-datasets==3.2.0
and h5py < 3.0.0
, which has updated in Dockerfile.
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--basic_text_classification \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
--folder /tmp/imagenet_to_tfrecord \
--imagenet /tmp/imagenettfrecord/tfrecord \
--cluster_mode yarn --worker_num 4 \
--cores 54 --memory 175G --batchSize 1792 \
--maxIteration 62000 --maxEpoch 100 --learningRate 0.0896 \
--checkpoint /tmp/models/inception \
--cluster_mode "spark-submit"
2021-11-04 01:22:09 INFO DistriOptimizer$:162 - Count dataset
2021-11-04 01:22:10 ERROR TaskSetManager:73 - Task 0 in stage 7.0 failed 4 times; aborting job
2021-11-04 01:22:10 ERROR DistriOptimizer$:1293 - Error: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
at com.intel.analytics.bigdl.dllib.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:1068)
at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1268)
at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1481)
at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1151)
at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:191)
at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 41) (172.30.27.4 executor 1): org.tensorflow.TensorFlowException: /tmp/imagenettfrecord/tfrecord/train/train-00000-of-01024; No such file or directory
[[{{node IteratorGetNext}}]]
at org.tensorflow.Session.run(Native Method)
at org.tensorflow.Session.access$100(Session.java:48)
at org.tensorflow.Session$Runner.runHelper(Session.java:326)
at org.tensorflow.Session$Runner.run(Session.java:276)
at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.$anonfun$run$5(GraphRunner.scala:133)
at com.intel.analytics.bigdl.dllib.common.zooUtils$.timeIt(zooUtils.scala:42)
at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.$anonfun$run$1(GraphRunner.scala:133)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.intel.analytics.bigdl.dllib.common.zooUtils$.timeIt(zooUtils.scala:42)
at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.run(GraphRunner.scala:113)
at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.runOutputs(GraphRunner.scala:102)
at com.intel.analytics.bigdl.orca.tfpark.TFDataFeatureSet$$anon$2.getNext(TFDataFeatureSet.scala:233)
at com.intel.analytics.bigdl.orca.tfpark.TFDataFeatureSet$$anon$2.hasNext(TFDataFeatureSet.scala:221)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.rdd.RDD.$anonfun$reduce$2(RDD.scala:1105)
at org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2290)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2291)
at org.apache.spark.rdd.RDD.$anonfun$reduce$1(RDD.scala:1120)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1102)
at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:164)
... 23 more
Caused by: org.tensorflow.TensorFlowException: /tmp/imagenettfrecord/tfrecord/train/train-00000-of-01024; No such file or directory
[[{{node IteratorGetNext}}]]
at org.tensorflow.Session.run(Native Method)
at org.tensorflow.Session.access$100(Session.java:48)
at org.tensorflow.Session$Runner.runHelper(Session.java:326)
at org.tensorflow.Session$Runner.run(Session.java:276)
at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.$anonfun$run$5(GraphRunner.scala:133)
at com.intel.analytics.bigdl.dllib.common.zooUtils$.timeIt(zooUtils.scala:42)
at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.$anonfun$run$1(GraphRunner.scala:133)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.intel.analytics.bigdl.dllib.common.zooUtils$.timeIt(zooUtils.scala:42)
at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.run(GraphRunner.scala:113)
at com.intel.analytics.bigdl.orca.tfpark.GraphRunner.runOutputs(GraphRunner.scala:102)
at com.intel.analytics.bigdl.orca.tfpark.TFDataFeatureSet$$anon$2.getNext(TFDataFeatureSet.scala:233)
at com.intel.analytics.bigdl.orca.tfpark.TFDataFeatureSet$$anon$2.hasNext(TFDataFeatureSet.scala:221)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.rdd.RDD.$anonfun$reduce$2(RDD.scala:1105)
at org.apache.spark.SparkContext.$anonfun$runJob$6(SparkContext.scala:2290)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
2021-11-04 01:22:10 INFO DistriOptimizer$:1307 - Retrying 1 times
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py", line 282, in <module>
checkpoint_trigger=checkpoint_trigger)
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 593, in fit
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_optimizer.py", line 776, in optimize
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/estimator/estimator.py", line 167, in train_minibatch
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o75.estimatorTrainMiniBatch.
: java.lang.NullPointerException
at com.intel.analytics.bigdl.dllib.optim.AbstractOptimizer.clearState(AbstractOptimizer.scala:241)
at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer.clearState(DistriOptimizer.scala:757)
at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1311)
at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1481)
at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1151)
at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:191)
at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Stopping orca context
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=172.16.0.200 \
--conf spark.driver.port=54321 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name analytics-zoo-autoestimator \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py \
/bigdl2.0/data/imagenet
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py", line 153, in <module>
main()
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py", line 149, in main
validation_method=[Accuracy(), Top5Accuracy()])
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/estimator/estimator.py", line 167, in train_minibatch
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o76.estimatorTrainMiniBatch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8) (Almaren-Node-200 executor driver):
jep.JepException: jep.JepException: <class 'AttributeError'>: module 'types' has no attribute 'ClassType'
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
Example torchmodel/resnet_finetune and /torchmodel/mnist have same exception.
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--basic_text_classification \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py \
--cluster_mode "spark-submit"
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 223, in <module>
args.non_interactive)
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 169, in main
epochs=max_epoch)
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 871, in fit
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 397, in to_dataset
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/utils.py", line 54, in xshards_to_tf_dataset
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 381, in from_rdd
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1162, in from_rdd
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1083, in __init__
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 153, in __init__
ValueError: batch_size should be a multiple of total core number, but got batch_size: 8 where total core number is 64
Build step 'Execute shell' marked build as failure
Finished: FAILURE
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--cifar10 \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/cifar10/cifar10.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/cifar10/cifar10.py \
--cluster_mode "spark-submit"
creating: createEveryEpoch
creating: createMaxEpoch
2021-11-05 02:18:09 ERROR TaskSetManager:73 - Task 2 in stage 1.0 failed 4 times; aborting job
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/pytorch/cifar10/cifar10.py", line 152, in <module>
checkpoint_trigger=EveryEpoch())
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/estimator.py", line 398, in fit
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/pytorch/estimator.py", line 324, in _handle_data_loader
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/feature/common.py", line 389, in pytorch_dataloader
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.createFeatureSetFromPyTorch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 16) (172.30.27.4 executor 1):
jep.JepException: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'bigdl.orca'
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
............
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--basic_text_classification \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/resnet/resnet-50-imagenet.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/resnet/resnet-50-imagenet.py \
--cluster_mode standalone --worker_num 8 --cores 17 \
--data_dir /tmp/imagenettfrecord/tfrecord --use_bf16 \
--enable_numa_binding \
--cluster_mode "spark-submit"
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/resnet/resnet-50-imagenet.py", line 371, in <module>
enable_numa_binding=args.enable_numa_binding)
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/common.py", line 268, in init_orca_context
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/ray/raycontext.py", line 540, in init
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/ray/raycontext.py", line 568, in _start_cluster
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 949, in collect
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(1, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/__init__.py", line 21, in <module>
prepare_env()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 171, in prepare_env
__prepare_analytics_zoo_env()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 74, in __prepare_analytics_zoo_env
analytics_zoo_classpath = get_analytics_zoo_classpath()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 116, in get_analytics_zoo_classpath
raise ValueError("Path {} specified BIGDL_CLASSPATH does not exist.".format(path))
ValueError: Path /opt/bigdl-0.14.0-SNAPSHOT/jars/*.jar specified BIGDL_CLASSPATH does not exist.
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1968)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2442)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client-transfer_learning \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/transfer_learning/transfer_learning.py \
--cluster_mode "spark-submit"
creating: createMaxEpoch
creating: createEveryEpoch
2021-11-05 02:54:41 INFO DistriOptimizer$:824 - caching training rdd ...
2021-11-05 02:54:41 INFO DistriOptimizer$:650 - Cache thread models...
2021-11-05 02:54:43 INFO DistriOptimizer$:652 - Cache thread models... done
2021-11-05 02:54:43 INFO DistriOptimizer$:162 - Count dataset
2021-11-05 02:54:44 ERROR TaskSetManager:73 - Task 1 in stage 9.0 failed 4 times; aborting job
2021-11-05 02:54:44 ERROR DistriOptimizer$:1293 - Error: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
at com.intel.analytics.bigdl.dllib.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
at com.intel.analytics.bigdl.dllib.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:1068)
at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1268)
at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1481)
at com.intel.analytics.bigdl.dllib.keras.models.InternalDistriOptimizer.train(Topology.scala:1151)
at com.intel.analytics.bigdl.dllib.estimator.Estimator.train(Estimator.scala:191)
at com.intel.analytics.bigdl.dllib.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 9.0 failed 4 times, most recent failure: Lost task 1.3 in stage 9.0 (TID 48) (172.30.39.4 executor 1): org.tensorflow.TensorFlowException: {{function_node __inference_Dataset_map__load_example_102}}
**./datasets/cats_and_dogs_filtered/train/dogs/dog.807.jpg; No such file or directory**
[[{{node ReadFile}}]]
[[IteratorGetNext]]
at org.tensorflow.Session.run(Native Method)
at org.tensorflow.Session.access$100(Session.java:48)
................
However ,run : ll ./datasets/cats_and_dogs_filtered/train/dogs/dog.807.jpg
got : -rw-r--r-- 1 root root 20189 Nov 5 02:53 ./datasets/cats_and_dogs_filtered/train/dogs/dog.807.jpg
The file do exist.
The file path need to be nfs path /bigdl2.0/data/datasets
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=172.16.0.200 \
--conf spark.driver.port=54321 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client-torchmodel-imagenet \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py \
/bigdl2.0/data/imagenet
creating: createTorchLoss
creating: createEstimator
2021-11-05 03:04:14 ERROR Executor:94 - Exception in task 0.0 in stage 1.0 (TID 1)
jep.JepException: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'pyspark'
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.exec(PythonInterpreter.scala:108)
at com.intel.analytics.bigdl.orca.net.PythonFeatureSet$.$anonfun$loadPythonSet$1(PythonFeatureSet.scala:90)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'pyspark'
at <string>.<module>(<string>:2)
at jep.Jep.exec(Native Method)
at jep.Jep.exec(Jep.java:478)
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.$anonfun$exec$1(PythonInterpreter.scala:106)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
2021-11-05 03:04:14 ERROR TaskSetManager:73 - Task 0 in stage 1.0 failed 1 times; aborting job
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py", line 153, in <module>
main()
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/imagenet/main.py", line 145, in main
train_featureSet = FeatureSet.pytorch_dataloader(train_loader)
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/feature/common.py", line 389, in pytorch_dataloader
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.createFeatureSetFromPyTorch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (Almaren-Node-200 executor driver):
jep.JepException: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'pyspark'
at com.intel.analytics.bigdl.orca.utils.PythonInterpreter$.threadExecute(PythonInterpreter.scala:98)
..............
source activate pytf1 export PYTHONHOME=/usr/local/envs/pytf1
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=172.16.0.200 \
--conf spark.driver.port=54321 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client-torchmodel-resnet_finetune \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/resnet_finetune/resnet_finetune.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/resnet_finetune/resnet_finetune.py \
/bigdl2.0/data/dogscats
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/torchmodel/train/resnet_finetune/resnet_finetune.py", line 104, in <module>
catdogModel = classifier.fit(trainingDF)
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 161, in fit
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 335, in _fit
File "/opt/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 332, in _fit_java
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o136.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 7) (Almaren-Node-200 executor driver):
jep.JepException: jep.JepException: <class 'ModuleNotFoundError'>: No module named 'bigdl'
source activate pytf1 PYTHONHOME=/usr/local/envs/pytf1
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--basic_text_classification \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/yolov3/yoloV3.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/yolov3/yoloV3.py \
--data_dir /bigdl2.0/data/yolov3 \
--weights /bigdl2.0/data/yolov3/yolov3.weights \
--class_num 2 \
--names /bigdl2.0/data/yolov3/voc2012.names \
--cluster_mode "spark-submit"
2021-11-05 04:45:57 ERROR TaskSetManager:73 - Task 0 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/yolov3/yoloV3.py", line 695, in <module>
main()
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf2/yolov3/yoloV3.py", line 638, in main
splits_names=[(options.data_year, options.split_name_train)], classes=class_map)
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/data/image/parquet_dataset.py", line 337, in write_parquet
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/data/image/parquet_dataset.py", line 318, in write_voc
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/data/image/parquet_dataset.py", line 74, in write
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 675, in createDataFrame
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 698, in _create_dataframe
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 486, in _createFromRDD
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 460, in _inferSchema
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1586, in first
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1566, in take
File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 1233, in runJob
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (172.30.39.4 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/__init__.py", line 21, in <module>
prepare_env()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 171, in prepare_env
__prepare_analytics_zoo_env()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 74, in __prepare_analytics_zoo_env
analytics_zoo_classpath = get_analytics_zoo_classpath()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 116, in get_analytics_zoo_classpath
raise ValueError("Path {} specified BIGDL_CLASSPATH does not exist.".format(path))
ValueError: Path /opt/bigdl-0.14.0-SNAPSHOT/jars/*.jar specified BIGDL_CLASSPATH does not exist.
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/__init__.py", line 21, in <module>
prepare_env()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 171, in prepare_env
__prepare_analytics_zoo_env()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 74, in __prepare_analytics_zoo_env
analytics_zoo_classpath = get_analytics_zoo_classpath()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 116, in get_analytics_zoo_classpath
raise ValueError("Path {} specified BIGDL_CLASSPATH does not exist.".format(path))
ValueError: Path /opt/bigdl-0.14.0-SNAPSHOT/jars/*.jar specified BIGDL_CLASSPATH does not exist.
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:635)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:470)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.api.python.PythonRDD$.$anonfun$runJob$1(PythonRDD.scala:166)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2236)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
image_segemnetation.py
command
${SPARK_HOME}/bin/spark-submit \ --master ${RUNTIME_SPARK_MASTER} \ --deploy-mode client \ --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \ --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \ --name test-bigdl2-client--basic_text_classification \ --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \ --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \ --conf spark.kubernetes.driver.label.az=true \ --conf spark.kubernetes.executor.label.az=true \ --conf spark.kubernetes.node.selector.spark=true \ --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \ --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \ --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \ --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \ --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \ --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \ --executor-cores ${RUNTIME_EXECUTOR_CORES} \ --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \ --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \ --driver-cores ${RUNTIME_DRIVER_CORES} \ --driver-memory ${RUNTIME_DRIVER_MEMORY} \ --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \ --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \ --conf spark.sql.catalogImplementation='in-memory' \ --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \ --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \ local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py \ --cluster_mode "spark-submit"
Exception
Traceback (most recent call last): File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 223, in <module> args.non_interactive) File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 169, in main epochs=max_epoch) File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 871, in fit File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 397, in to_dataset File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/utils.py", line 54, in xshards_to_tf_dataset File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 381, in from_rdd File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1162, in from_rdd File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1083, in __init__ File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 153, in __init__ ValueError: batch_size should be a multiple of total core number, but got batch_size: 8 where total core number is 64 Build step 'Execute shell' marked build as failure Finished: FAILURE
set batch_size to 64 or 64*n. Or modify total core number.
image_segemnetation.py
command
${SPARK_HOME}/bin/spark-submit \ --master ${RUNTIME_SPARK_MASTER} \ --deploy-mode client \ --conf spark.driver.host=${RUNTIME_DRIVER_HOST} \ --conf spark.driver.port=${RUNTIME_DRIVER_PORT} \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \ --name test-bigdl2-client--basic_text_classification \ --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \ --conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \ --conf spark.kubernetes.driver.label.az=true \ --conf spark.kubernetes.executor.label.az=true \ --conf spark.kubernetes.node.selector.spark=true \ --conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \ --conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \ --conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \ --conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \ --conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \ --conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \ --executor-cores ${RUNTIME_EXECUTOR_CORES} \ --executor-memory ${RUNTIME_EXECUTOR_MEMORY} \ --total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \ --driver-cores ${RUNTIME_DRIVER_CORES} \ --driver-memory ${RUNTIME_DRIVER_MEMORY} \ --properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \ --py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \ --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \ --conf spark.sql.catalogImplementation='in-memory' \ --conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \ --conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \ local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py \ --cluster_mode "spark-submit"
Exception
Traceback (most recent call last): File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 223, in <module> args.non_interactive) File "/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py", line 169, in main epochs=max_epoch) File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 871, in fit File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/estimator.py", line 397, in to_dataset File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/learn/tf/utils.py", line 54, in xshards_to_tf_dataset File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 381, in from_rdd File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1162, in from_rdd File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 1083, in __init__ File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/tfpark/tf_dataset.py", line 153, in __init__ ValueError: batch_size should be a multiple of total core number, but got batch_size: 8 where total core number is 64 Build step 'Execute shell' marked build as failure Finished: FAILURE
Solution
set batch_size to 64 or 64*n. Or modify total core number.
this problem fixed, now the problem is about:
${SPARK_HOME}/bin/spark-submit \
--master ${RUNTIME_SPARK_MASTER} \
--deploy-mode client \
--conf spark.driver.host=${RUNTIME_DRIVER_HOST} \
--conf spark.driver.port=${RUNTIME_DRIVER_PORT} \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=${RUNTIME_K8S_SERVICE_ACCOUNT} \
--name test-bigdl2-client--basic_text_classification \
--conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE} \
--conf spark.executor.instances=${RUNTIME_EXECUTOR_INSTANCES} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.options.claimName=${RUNTIME_PERSISTENT_VOLUME_CLAIM} \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.${RUNTIME_PERSISTENT_VOLUME_CLAIM}.mount.path=/bigdl2.0/data \
--conf spark.kubernetes.driver.label.az=true \
--conf spark.kubernetes.executor.label.az=true \
--conf spark.kubernetes.node.selector.spark=true \
--conf spark.kubernetes.driverEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.driverEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.executorEnv.http_proxy=${http_proxy} \
--conf spark.kubernetes.executorEnv.https_proxy=${https_proxy} \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.pyspark.driver.python=/usr/local/envs/pytf1/bin/python \
--conf spark.pyspark.python=/usr/local/envs/pytf1/bin/python \
--conf spark.executorEnv.PYTHONHOME=/usr/local/envs/pytf1 \
--executor-cores ${RUNTIME_EXECUTOR_CORES} \
--executor-memory ${RUNTIME_EXECUTOR_MEMORY} \
--total-executor-cores ${RUNTIME_TOTAL_EXECUTOR_CORES} \
--driver-cores ${RUNTIME_DRIVER_CORES} \
--driver-memory ${RUNTIME_DRIVER_MEMORY} \
--properties-file ${BIGDL_HOME}/conf/spark-bigdl.conf \
--py-files ${BIGDL_HOME}/python/bigdl-orca-spark_${SPARK_VERSION}-${BIGDL_VERSION}-python-api.zip,/opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/inception/inception.py \
--conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp \
--conf spark.sql.catalogImplementation='in-memory' \
--conf spark.driver.extraClassPath=local://${BIGDL_HOME}/jars/* \
--conf spark.executor.extraClassPath=local://${BIGDL_HOME}/jars/* \
local:///opt/bigdl-0.14.0-SNAPSHOT/examples/orca/learn/tf/image_segmentation/image_segmentation.py \
--batch_size 64 \
--file_path /bigdl2.0/data/carvana \
--cluster_mode "spark-submit"
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length
return self.loads(obj)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads
return pickle.loads(obj, encoding=encoding)
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 668, in _load_unlocked
File "<frozen importlib._bootstrap>", line 638, in _load_backward_compatible
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/orca/__init__.py", line 21, in <module>
prepare_env()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 171, in prepare_env
__prepare_analytics_zoo_env()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 74, in __prepare_analytics_zoo_env
analytics_zoo_classpath = get_analytics_zoo_classpath()
File "./bigdl-orca-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/zoo_engine.py", line 116, in get_analytics_zoo_classpath
raise ValueError("Path {} specified BIGDL_CLASSPATH does not exist.".format(path))
ValueError: Path /opt/bigdl-0.14.0-SNAPSHOT/jars/* specified BIGDL_CLASSPATH does not exist.
tf and tf2 test in cluster: success: http://10.112.231.51:18888/view/BigDL-2.0-NB/job/BigDL2.0-K8s-ExampleTests-Part3-Cluster/1/console
tf and tf2 test in client: (without resnet): http://10.112.231.51:18888/view/BigDL-2.0-NB/job/BigDL2.0-K8s-ExampleTests-Part3/