intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
17 stars 3 forks source link

Only half of the total cores was utilized when training resnet-50 with MKL-DNN #1046

Closed nwang2 closed 3 years ago

nwang2 commented 5 years ago

When training resnet-50 with MKL-DNN, we find only half of the total cores was used no matter how many executor cores were assigned. This will impact the performance of a HT-off system. On a HT-off system with 36 cores, when running below command line, we see only 18 cores was utilized in each worker node.

spark-submit \ --master spark://bdw:7077 \ --executor-cores 36 \ --total-executor-cores 144 \ --executor-memory 180G \ --driver-memory 30G \ --conf spark.network.timeout=10000000 \ --conf spark.executor.extraJavaOptions="-Dbigdl.engineType=mkldnn" \ --conf spark.driver.extraJavaOptions="-Dbigdl.engineType=mkldnn" \ --driver-class-path /root/analytics-zoo/dist/lib/analytics-zoo-bigdl_0.8.0-spark_2.1.0-0.6.0-SNAPSHOT-jar-with-dependencies.jar \ --class com.intel.analytics.zoo.examples.resnet.TrainImageNet \ /root/analytics-zoo/dist/lib/analytics-zoo-bigdl_0.8.0-spark_2.1.0-0.6.0-SNAPSHOT-jar-with-dependencies.jar \ --batchSize 1440 \ --nEpochs 5 --learningRate 0.1 --warmupEpoch 5 --maxLr 3.2 \ --depth 50 --classes 365 \ --cache ./models \ --memoryType PMEM \ -f hdfs://bdw:8020/user/root/places365_challenge_seq

jason-dai commented 5 years ago

@jenniew please take a look

qiuxin2012 commented 5 years ago

reference is typo...

qiyuangong commented 5 years ago

Seems like MKL engine init with (coreNum / 2) for total core number.

nwang2 commented 5 years ago

This issue only happen on MKL-DNN engine. For MKL (inceptionv1 training), all cores can be used if HT is off.

jenniew commented 5 years ago

Yes, this issue is only on MKL-DNN. For MKL-DNN engine type, it sets mklNumThread to coreNumber/2. MKL blas does not set this.

qiyuangong commented 4 years ago

This issue need to be fixed in BigDL. It's an core number related misconfiguration on HT enabled server.

helenlly commented 3 years ago

@nwang2we 'll close the issue since no updates for a long time. If you have any questions, pls free to re-open it, thanks.