intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
17 stars 3 forks source link

ResNet-50 training failed with engine mklblas on DCPMM #1034

Closed ZeweiChen11 closed 4 years ago

ZeweiChen11 commented 5 years ago

When trying to train resnet-50 example on Place365 with engine mklblas, spark executor will exist with error like:

Job aborted due to stage failure: Task 5 in stage 11.1 failed 4 times, most recent failure: Lost task 5.3 in stage 11.1 (TID 15119, 192.168.75.3, executor 5): java.lang.AssertionError: assertion failed: Not enough memory to allocate: 197256216576 bytes!
    at scala.Predef$.assert(Predef.scala:170)
    at com.intel.analytics.zoo.feature.pmem.NativeArray.<init>(NativeArray.scala:69)
    at com.intel.analytics.zoo.feature.pmem.NativeVarLenArray.<init>(NativeVarLenArray.scala:26)
    at com.intel.analytics.zoo.feature.pmem.VarLenBytesArray.<init>(NativeByteArray.scala:22)
    at com.intel.analytics.zoo.feature.pmem.ImageFeatureConverter.toArray(FeatureSet.scala:129)
    at com.intel.analytics.zoo.feature.pmem.PmemFeatureSet$$anonfun$2.apply(FeatureSet.scala:192)
    at com.intel.analytics.zoo.feature.pmem.PmemFeatureSet$$anonfun$2.apply(FeatureSet.scala:187)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
    at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)

Driver log:

2019-07-09 21:56:36 INFO  DistriOptimizer$:784 - caching training rdd ...
2019-07-09 22:14:00 INFO  DistriOptimizer$:624 - Cache thread models...
2019-07-09 22:14:14 INFO  DistriOptimizer$:626 - Cache thread models... done
2019-07-09 22:14:14 INFO  DistriOptimizer$:154 - Count dataset
2019-07-09 23:41:45 INFO  DistriOptimizer$:158 - Count dataset complete. Time elapsed: 5250.809709913s
2019-07-09 23:41:45 INFO  DistriOptimizer$:166 - config  {
        computeThresholdbatchSize: 100
        maxDropPercentage: 0.0
        warmupIterationNum: 200
        isLayerwiseScaled: false
        dropPercentage: 0.0
 }
2019-07-09 23:41:45 INFO  DistriOptimizer$:170 - Shuffle data
2019-07-09 23:41:45 INFO  DistriOptimizer$:173 - Shuffle data complete. Takes 0.021088714s
2019-07-09 23:42:39 ERROR TaskSchedulerImpl:70 - Lost executor 1 on 192.168.75.2: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
2019-07-09 23:42:40 ERROR TaskSchedulerImpl:70 - Lost executor 7 on 192.168.75.1: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
2019-07-09 23:43:25 ERROR TaskSchedulerImpl:70 - Lost executor 3 on 192.168.75.4: worker lost
2019-07-09 23:43:25 ERROR TaskSchedulerImpl:70 - Lost executor 2 on 192.168.75.4: worker lost

Script:

spark-submit \
 --master spark://clx-node5:7077 \
 --executor-cores 18 \
 --total-executor-cores 144 \
 --executor-memory 90G \
 --driver-memory 30G \
 --conf spark.network.timeout=10000000 \
 --conf spark.executor.extraJavaOptions="-Dbigdl.engineType=mklblas" \
 --conf spark.driver.extraJavaOptions="-Dbigdl.engineType=mklblas" \
 --driver-class-path /root/analytics-zoo/dist/lib/analytics-zoo-bigdl_0.8.0-spark_2.1.0-0.6.0-SNAPSHOT-jar-with-dependencies.jar \
 --class com.intel.analytics.zoo.examples.resnet.TrainImageNet \
 /root/analytics-zoo/dist/lib/analytics-zoo-bigdl_0.8.0-spark_2.1.0-0.6.0-SNAPSHOT-jar-with-dependencies.jar \
 --batchSize 1440 \
 --nEpochs 1 --learningRate 0.1 --warmupEpoch 5 --maxLr 3.2 \
 --depth 50 --classes 365 \
 --cache ./models \
 --memoryType PMEM \
 -f  hdfs://clx-node5:8020/user/root/places365_challenge_seq

Here, there are 2 executors per node and 4 work nodes in cluster totally. Memory configuration per node: 192GB DRAM + 512GB DCPMM. But we didn't meet this error when using mkldnn engine.

 --conf spark.executor.extraJavaOptions="-Dbigdl.engineType=mkldnn" \
 --conf spark.driver.extraJavaOptions="-Dbigdl.engineType=mkldnn" \
qiuxin2012 commented 5 years ago

Looks like Lossing executor is caused by OOM: java heap space, please check your executor's log.

ZeweiChen11 commented 5 years ago

The same scenario for ImageNet also reports error in executor:

“# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 8388608 bytes for committing reserved memory.” 

Executor will lose due to this error above. But why didn't report the same error trace whatever the dataset is?