intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
18 stars 4 forks source link

Dogs_vs_Cats example OpenCVMat issue #622

Closed Le-Zheng closed 4 years ago

Le-Zheng commented 4 years ago

When setting num_executors=2 , it works well. However, setting a large number of executors will occur in this error.

py4j.protocol.Py4JJavaError: An error occurred while calling o71.estimatorTrainMiniBatch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 20, Almaren-Node-040, executor 8): java.io.InvalidClassException: 
com.intel.analytics.bigdl.transform.vision.image.opencv.OpenCVMat; unable to create instance

Full error information:

2020-09-25 16:54:26.543211: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
creating: createIdentityCriterion
creating: createMergeFeatureLabelFeatureTransformer
creating: createSampleToMiniBatch
creating: createEstimator
creating: createMaxIteration
creating: createEveryEpoch
2020-09-25 16:54:30 INFO  DistriOptimizer$:808 - caching training rdd ...
[Stage 1:>                                                          (0 + 8) / 8]2020-09-25 16:57:14 ERROR TaskSetManager:70 - Task 1 in stage 1.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "dogs-cats-test.py", line 163, in <module>
    estimator.train(input_fn, steps=10)
  File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/tfpark/estimator.py", line 169, in train
    opt.optimize(MaxIteration(steps))
  File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/tfpark/tf_optimizer.py", line 744, in optimize
    end_trigger=end_trigger)
  File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/pipeline/estimator/estimator.py", line 168, in train_minibatch
    validation_method)
  File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/common/utils.py", line 133, in callZooFunc
    raise e
  File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/zoo/common/utils.py", line 127, in callZooFunc
    java_result = api(*args)
  File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/work/client/anaconda3/envs/langchao/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o71.estimatorTrainMiniBatch.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 20, Almaren-Node-040, executor 8): java.io.InvalidClassException: 
com.intel.analytics.bigdl.transform.vision.image.opencv.OpenCVMat; unable to create instance
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1788)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
        at scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:174)
        at scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:174)
        at scala.collection.mutable.HashTable$class.init(HashTable.scala:109)
        at scala.collection.mutable.HashMap.init(HashMap.scala:40)
        at scala.collection.mutable.HashMap.readObject(HashMap.scala:174)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
        at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
        at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:188)
        at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:185)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
        at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
        at scala.collection.AbstractIterator.to(Iterator.scala:1334)
        at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
        at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
        at com.intel.analytics.zoo.feature.DRAMFeatureSet$$anonfun$7.apply(FeatureSet.scala:637)
        at com.intel.analytics.zoo.feature.DRAMFeatureSet$$anonfun$7.apply(FeatureSet.scala:636)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
Le-Zheng commented 4 years ago

related to https://github.com/intel-analytics/analytics-zoo-internal/issues/616