tfpark estimator_inception

this example has two part needed to improve.

command

/opt/spark/bin/spark-submit
   --master k8s://https://127.0.0.1:8443
   --deploy-mode cluster
   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
   --name analytics-zoo
   --conf spark.kubernetes.container.image=${RUNTIME_K8S_SPARK_IMAGE}
   --conf spark.executor.instances=1
   --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim
   --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/zoo
   --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName=nfsvolumeclaim
   --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path=/zoo
   --conf spark.kubernetes.driver.label.az=true
   --conf spark.kubernetes.executor.label.az=true
   --conf spark.kubernetes.node.selector.spark=true
   --executor-cores 16
   --executor-memory 20g 
  --total-executor-cores 64
   --driver-cores 4
   --driver-memory 50g
   --properties-file /opt/analytics-zoo-0.11.0-SNAPSHOT/conf/spark-analytics-zoo.conf
   --py-files /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip,/opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py
   --conf spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp
   --conf spark.sql.catalogImplementation='in-memory'
   --conf spark.driver.extraClassPath=/opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-jar-with-dependencies.jar
   --conf spark.executor.extraClassPath=/opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-jar-with-dependencies.jar   file:///opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py
 --image-path /zoo/data2/cat_dog
 --num-classes 2

Issue 1 - ValueError: batch_size should be a multiple of total core number, but got batch_size: 16 where total core number is 64

https://github.com/intel-analytics/analytics-zoo/blob/master/docker/hyperzoo/submit-examples-on-k8s.md

tfpark estimator_inception does not have the --batchSize options, which will cause an exception when executing the example.

ValueError: batch_size should be a multiple of total core number, but got batch_size: 16 where total core number is 64

recommend: add such a --batchSize option to the main function of the estimator_inception.py file

Issue 2 - org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 11.0 failed 4 times, most recent failure: Lost task 2.3 in stage 11.0 (TID 34, 172.30.14.5, executor 1): java.lang.ArithmeticException: / by zero

When I manually changed the batch_size to 64 and then tested, another exception occurred:

+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@" $PYSPARK_PRIMARY $PYSPARK_ARGS)
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.30.14.4 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner file:///opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py --image-path /zoo/data2/cat_dog --num-classes 2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2021-05-19 08:37:30 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/zoo_optimizer.py:73: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

pyspark_submit_args is:  --driver-class-path /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell
2021-05-19 08:37:33 INFO  SparkContext:54 - Running Spark version 2.4.3
2021-05-19 08:37:33 INFO  SparkContext:54 - Submitted application: analytics-zoo
2021-05-19 08:37:33 INFO  SecurityManager:54 - Changing view acls to: root
2021-05-19 08:37:33 INFO  SecurityManager:54 - Changing modify acls to: root
2021-05-19 08:37:33 INFO  SecurityManager:54 - Changing view acls groups to:
2021-05-19 08:37:33 INFO  SecurityManager:54 - Changing modify acls groups to:
2021-05-19 08:37:33 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2021-05-19 08:37:33 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 7078.
2021-05-19 08:37:33 INFO  SparkEnv:54 - Registering MapOutputTracker
2021-05-19 08:37:33 INFO  SparkEnv:54 - Registering BlockManagerMaster
2021-05-19 08:37:33 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2021-05-19 08:37:33 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2021-05-19 08:37:33 INFO  DiskBlockManager:54 - Created local directory at /var/data/spark-6cdf4fe1-1f5c-4492-93f3-f89dde2927da/blockmgr-4007db47-e163-435e-9f75-752e22cf05ec
2021-05-19 08:37:33 INFO  MemoryStore:54 - MemoryStore started with capacity 26.5 GB
2021-05-19 08:37:33 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2021-05-19 08:37:33 INFO  log:192 - Logging initialized @4451ms
2021-05-19 08:37:33 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2021-05-19 08:37:33 INFO  Server:419 - Started @4513ms
2021-05-19 08:37:33 INFO  AbstractConnector:278 - Started ServerConnector@1f61fe06{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2021-05-19 08:37:33 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5360b238{/jobs,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2ab44f4c{/jobs/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@68208be7{/jobs/job,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@109bad0c{/jobs/job/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@185d5e7c{/stages,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@74ce8123{/stages/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4afaec4d{/stages/stage,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@755f8bf0{/stages/stage/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@68da8992{/stages/pool,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4961dd09{/stages/pool/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7a9510df{/storage,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@739c1476{/storage/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4f36958e{/storage/rdd,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@8f880f4{/storage/rdd/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5a397adc{/environment,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4995fa49{/environment/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@32f48c10{/executors,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7ec7459a{/executors/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6f393f04{/executors/threadDump,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@998c718{/executors/threadDump/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1f1a4e37{/static,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@478777cf{/,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@dd9164{/api,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@31017e69{/jobs/job/kill,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6c4cc6b3{/stages/stage/kill,null,AVAILABLE,@Spark}
2021-05-19 08:37:33 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://analytics-zoo-1621413464769-driver-svc.default.svc:4040
2021-05-19 08:37:34 INFO  SparkContext:54 - Added file file:///opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py at spark://analytics-zoo-1621413464769-driver-svc.default.svc:7078/files/estimator_inception.py with timestamp 1621413454008
2021-05-19 08:37:34 INFO  Utils:54 - Copying /opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py to /var/data/spark-6cdf4fe1-1f5c-4492-93f3-f89dde2927da/spark-a6a8ffe7-c463-4e1d-9ef8-9ffbcc8503a2/userFiles-d5bea555-ee96-4821-a1fd-ea86103a1864/estimator_inception.py
2021-05-19 08:37:34 INFO  SparkContext:54 - Added file file:///opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip at spark://analytics-zoo-1621413464769-driver-svc.default.svc:7078/files/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip with timestamp 1621413454021
2021-05-19 08:37:34 INFO  Utils:54 - Copying /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip to /var/data/spark-6cdf4fe1-1f5c-4492-93f3-f89dde2927da/spark-a6a8ffe7-c463-4e1d-9ef8-9ffbcc8503a2/userFiles-d5bea555-ee96-4821-a1fd-ea86103a1864/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip
2021-05-19 08:37:34 WARN  SparkContext:66 - The path file:///opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py has been added already. Overwriting of added paths is not supported in the current version.
2021-05-19 08:37:35 INFO  ExecutorPodsAllocator:54 - Going to request 1 executors from Kubernetes.
2021-05-19 08:37:35 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
2021-05-19 08:37:35 INFO  NettyBlockTransferService:54 - Server created on analytics-zoo-1621413464769-driver-svc.default.svc:7079
2021-05-19 08:37:35 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2021-05-19 08:37:35 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, analytics-zoo-1621413464769-driver-svc.default.svc, 7079, None)
2021-05-19 08:37:35 INFO  BlockManagerMasterEndpoint:54 - Registering block manager analytics-zoo-1621413464769-driver-svc.default.svc:7079 with 26.5 GB RAM, BlockManagerId(driver, analytics-zoo-1621413464769-driver-svc.default.svc, 7079, None)
2021-05-19 08:37:35 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, analytics-zoo-1621413464769-driver-svc.default.svc, 7079, None)
2021-05-19 08:37:35 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, analytics-zoo-1621413464769-driver-svc.default.svc, 7079, None)
2021-05-19 08:37:35 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6b21f285{/metrics/json,null,AVAILABLE,@Spark}
2021-05-19 08:37:39 INFO  KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.30.14.5:37420) with ID 1
2021-05-19 08:37:39 INFO  KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 1.0
cls.getname: com.intel.analytics.bigdl.python.api.Sample
BigDLBasePickler registering: bigdl.util.common  Sample
cls.getname: com.intel.analytics.bigdl.python.api.EvaluatedResult
BigDLBasePickler registering: bigdl.util.common  EvaluatedResult
cls.getname: com.intel.analytics.bigdl.python.api.JTensor
BigDLBasePickler registering: bigdl.util.common  JTensor
cls.getname: com.intel.analytics.bigdl.python.api.JActivity
BigDLBasePickler registering: bigdl.util.common  JActivity
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpd114ds5v
WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/estimator.py:140: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/estimator.py:141: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

creating: createImageBytesToMat
creating: createImageResize
creating: createImageRandomCrop
creating: createImageHFlip
creating: createImageRandomPreprocessing
creating: createImageChannelNormalize
creating: createImageMatToTensor
creating: createImageSetToSample
creating: createChainedPreprocessing
creating: createImageFeatureToSample
WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/tf_dataset.py:211: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/tf_dataset.py:212: The name tf.add_to_collection is deprecated. Please use tf.compat.v1.add_to_collection instead.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /root/miniconda3/lib/python3.7/site-packages/tf_slim/layers/layers.py:1089: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py:69: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

WARNING:tensorflow:From /root/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/losses/losses_impl.py:121: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py:70: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/estimator.py:153: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-05-19 08:37:46.929729: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-05-19 08:37:46.929777: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2021-05-19 08:37:46.929809: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (analytics-zoo-1621413464769-driver): /proc/driver/nvidia/version does not exist
2021-05-19 08:37:46.930141: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-05-19 08:37:46.953129: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194760000 Hz
2021-05-19 08:37:46.965457: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55bbb202fc20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-05-19 08:37:46.965487: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/estimator.py:154: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/estimator.py:158: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

creating: createFakeOptimMethod
WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/tf_optimizer.py:299: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/tf_optimizer.py:191: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/tf_optimizer.py:206: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/tf_optimizer.py:280: The name tf.tables_initializer is deprecated. Please use tf.compat.v1.tables_initializer instead.

WARNING:tensorflow:From /opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/tf_optimizer.py:289: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

creating: createTFTrainingHelper
linux-x86_64/libiomp5.so
linux-x86_64/libmklml_intel.so
linux-x86_64/libtensorflow_framework-zoo.so
linux-x86_64/libtensorflow_jni.so
2021-05-19 08:37:51.997625: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2021-05-19 08:37:52.004683: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2194760000 Hz
2021-05-19 08:37:52.004920: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7faba991a030 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-05-19 08:37:52.004952: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
creating: createIdentityCriterion
creating: createMergeFeatureLabelFeatureTransformer
creating: createSampleToMiniBatch
creating: createEstimator
creating: createMaxIteration
creating: createEveryEpoch
2021-05-19 08:37:55 INFO  DistriOptimizer$:818 - caching training rdd ...
2021-05-19 08:37:57 INFO  DistriOptimizer$:649 - Cache thread models...
2021-05-19 08:38:03 INFO  DistriOptimizer$:651 - Cache thread models... done
2021-05-19 08:38:03 INFO  DistriOptimizer$:161 - Count dataset
2021-05-19 08:38:03 INFO  DistriOptimizer$:165 - Count dataset complete. Time elapsed: 0.159493995s
2021-05-19 08:38:03 INFO  DistriOptimizer$:173 - config  {
        computeThresholdbatchSize: 100
        maxDropPercentage: 0.0
        warmupIterationNum: 200
        isLayerwiseScaled: false
        dropPercentage: 0.0
 }
2021-05-19 08:38:03 INFO  DistriOptimizer$:177 - Shuffle data
2021-05-19 08:38:03 INFO  DistriOptimizer$:180 - Shuffle data complete. Takes 0.018781056s
2021-05-19 08:38:04 ERROR TaskSetManager:70 - Task 2 in stage 11.0 failed 4 times; aborting job
2021-05-19 08:38:04 ERROR DistriOptimizer$:1287 - Error: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.intel.analytics.zoo.pipeline.api.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302)
        at com.intel.analytics.zoo.pipeline.api.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329)
        at com.intel.analytics.zoo.pipeline.api.keras.models.InternalOptimizerUtil$.optimizeModels(Topology.scala:1063)
        at com.intel.analytics.zoo.pipeline.api.keras.models.InternalDistriOptimizer.train(Topology.scala:1262)
        at com.intel.analytics.zoo.pipeline.api.keras.models.InternalDistriOptimizer.train(Topology.scala:1475)
        at com.intel.analytics.zoo.pipeline.api.keras.models.InternalDistriOptimizer.train(Topology.scala:1145)
        at com.intel.analytics.zoo.pipeline.estimator.Estimator.train(Estimator.scala:190)
        at com.intel.analytics.zoo.pipeline.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 11.0 failed 4 times, most recent failure: Lost task 2.3 in stage 11.0 (TID 34, 172.30.14.5, executor 1): java.lang.ArithmeticException: / by zero
        at com.intel.analytics.zoo.feature.CachedDistributedFeatureSet$$anonfun$data$2$$anon$2.next(FeatureSet.scala:282)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
        at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:228)
        at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:218)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
        at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1035)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.reduce(RDD.scala:1017)
        at com.intel.analytics.bigdl.optim.DistriOptimizer$.optimize(DistriOptimizer.scala:353)
        ... 23 more
Caused by: java.lang.ArithmeticException: / by zero
        at com.intel.analytics.zoo.feature.CachedDistributedFeatureSet$$anonfun$data$2$$anon$2.next(FeatureSet.scala:282)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:331)
        at com.intel.analytics.bigdl.dataset.SampleToMiniBatch$$anon$2.next(Transformer.scala:323)
        at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:228)
        at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$4.apply(DistriOptimizer.scala:218)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more

2021-05-19 08:38:04 INFO  DistriOptimizer$:1301 - Retrying 1 times
Traceback (most recent call last):
  File "/opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py", line 90, in <module>
    main(options)
  File "/opt/analytics-zoo-examples/python/tensorflow/tfpark/estimator/estimator_inception.py", line 80, in main
    estimator.train(input_fn, steps=100)
  File "/opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/estimator.py", line 170, in train
  File "/opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/tfpark/tf_optimizer.py", line 780, in optimize
  File "/opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/pipeline/estimator/estimator.py", line 168, in train_minibatch
  File "/opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/common/utils.py", line 135, in callZooFunc
  File "/opt/analytics-zoo-0.11.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.11.0-SNAPSHOT-python-api.zip/zoo/common/utils.py", line 129, in callZooFunc
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.estimatorTrainMiniBatch.
: java.lang.NullPointerException
        at com.intel.analytics.bigdl.optim.AbstractOptimizer.clearState(AbstractOptimizer.scala:240)
        at com.intel.analytics.bigdl.optim.DistriOptimizer.clearState(DistriOptimizer.scala:751)
        at com.intel.analytics.zoo.pipeline.api.keras.models.InternalDistriOptimizer.train(Topology.scala:1305)
        at com.intel.analytics.zoo.pipeline.api.keras.models.InternalDistriOptimizer.train(Topology.scala:1475)
        at com.intel.analytics.zoo.pipeline.api.keras.models.InternalDistriOptimizer.train(Topology.scala:1145)
        at com.intel.analytics.zoo.pipeline.estimator.Estimator.train(Estimator.scala:190)
        at com.intel.analytics.zoo.pipeline.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

intel-analytics / analytics-zoo

submit-examples-on-k8s: two issue about tfpark estimator_inception example #265

tfpark estimator_inception

command

Issue 1 - ValueError: batch_size should be a multiple of total core number, but got batch_size: 16 where total core number is 64

Issue 2 - org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 11.0 failed 4 times, most recent failure: Lost task 2.3 in stage 11.0 (TID 34, 172.30.14.5, executor 1): java.lang.ArithmeticException: / by zero