When setting num_executors=2 , it works well. However, setting a large number of executors will occur in this error.
py4j.protocol.Py4JJavaError: An error occurred while calling o71.estimatorTrainMiniBatch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 20, Almaren-Node-040, executor 8): java.io.InvalidClassException:
com.intel.analytics.bigdl.transform.vision.image.opencv.OpenCVMat; unable to create instance
Full error information:
2020-09-25 16:54:26.543211: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
creating: createIdentityCriterion
creating: createMergeFeatureLabelFeatureTransformer
creating: createSampleToMiniBatch
creating: createEstimator
creating: createMaxIteration
creating: createEveryEpoch
2020-09-25 16:54:30 INFO DistriOptimizer$:808 - caching training rdd ...
[Stage 1:> (0 + 8) / 8]2020-09-25 16:57:14 ERROR TaskSetManager:70 - Task 1 in stage 1.0 failed 4 times; aborting job
Traceback (most recent call last):
File "dogs-cats-test.py", line 163, in <module>
estimator.train(input_fn, steps=10)
File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/tfpark/estimator.py", line 169, in train
opt.optimize(MaxIteration(steps))
File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/tfpark/tf_optimizer.py", line 744, in optimize
end_trigger=end_trigger)
File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/pipeline/estimator/estimator.py", line 168, in train_minibatch
validation_method)
File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/common/utils.py", line 133, in callZooFunc
raise e
File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/common/utils.py", line 127, in callZooFunc
java_result = api(*args)
File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o71.estimatorTrainMiniBatch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 20, Almaren-Node-040, executor 8): java.io.InvalidClassException:
com.intel.analytics.bigdl.transform.vision.image.opencv.OpenCVMat; unable to create instance
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1788)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:174)
at scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:174)
at scala.collection.mutable.HashTable$class.init(HashTable.scala:109)
at scala.collection.mutable.HashMap.init(HashMap.scala:40)
at scala.collection.mutable.HashMap.readObject(HashMap.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:188)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:185)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
at com.intel.analytics.zoo.feature.DRAMFeatureSet$$anonfun$7.apply(FeatureSet.scala:637)
at com.intel.analytics.zoo.feature.DRAMFeatureSet$$anonfun$7.apply(FeatureSet.scala:636)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
When setting
num_executors=2
, it works well. However, setting a large number of executors will occur in this error.Full error information: