Open glorysdj opened 3 years ago
similar problem when run customloss.py
change local optimizer to distriOptimizer, runs into new error Task 3 in stage 3.0 failed 4 times, most recent failure: Lost task 3.3 in stage 3.0 (TID 26) (172.30.27.4 executor 1): java.lang.IllegalArgumentException: requirement failed: firstIndex(3) out of range [0, 3)
:
2021-10-25 04:51:41 INFO DistriOptimizer$:824 - caching training rdd ...
2021-10-25 04:51:48 INFO DistriOptimizer$:650 - Cache thread models...
2021-10-25 04:51:49 ERROR TaskSetManager:73 - Task 3 in stage 3.0 failed 4 times; aborting job
2021-10-25 04:51:49 ERROR TaskSetManager:73 - Task 3 in stage 3.0 failed 4 times; aborting job
Traceback (most recent call last):
File "/opt/bigdl-0.14.0-SNAPSHOT/examples/dllib/autograd/custom.py", line 59, in <module>
distributed=True)
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-friesian-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/keras/engine/topology.py", line 239, in fit
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-friesian-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 164, in callZooFunc
File "/opt/bigdl-0.14.0-SNAPSHOT/python/bigdl-friesian-spark_3.1.2-0.14.0-SNAPSHOT-python-api.zip/bigdl/dllib/utils/file_utils.py", line 158, in callZooFunc
File "/opt/work/spark-3.1.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/opt/work/spark-3.1.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o61.zooFit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 3.0 failed 4 times, most recent failure: Lost task 3.3 in stage 3.0 (TID 26) (172.30.27.4 executor 1): java.lang.IllegalArgumentException: requirement failed: firstIndex(3) out of range [0, 3)
at scala.Predef$.require(Predef.scala:281)
at com.intel.analytics.bigdl.dllib.tensor.DenseTensor$.narrow(DenseTensor.scala:2618)
at com.intel.analytics.bigdl.dllib.tensor.DenseTensor.narrow(DenseTensor.scala:444)
at com.intel.analytics.bigdl.dllib.optim.parameters.AllReduceParameter.init(AllReduceParameter.scala:164)
at com.intel.analytics.bigdl.dllib.optim.DistriOptimizer$.$anonfun$initThreadModels$2(DistriOptimizer.scala:635)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
try to increase batch size
autograd custom.py
client command
cluster command
exception