Closed nwang2 closed 3 years ago
@jenniew please take a look
reference is typo...
Seems like MKL engine init with (coreNum / 2) for total core number.
This issue only happen on MKL-DNN engine. For MKL (inceptionv1 training), all cores can be used if HT is off.
Yes, this issue is only on MKL-DNN. For MKL-DNN engine type, it sets mklNumThread to coreNumber/2. MKL blas does not set this.
This issue need to be fixed in BigDL. It's an core number related misconfiguration on HT enabled server.
@nwang2we 'll close the issue since no updates for a long time. If you have any questions, pls free to re-open it, thanks.
When training resnet-50 with MKL-DNN, we find only half of the total cores was used no matter how many executor cores were assigned. This will impact the performance of a HT-off system. On a HT-off system with 36 cores, when running below command line, we see only 18 cores was utilized in each worker node.
spark-submit \ --master spark://bdw:7077 \ --executor-cores 36 \ --total-executor-cores 144 \ --executor-memory 180G \ --driver-memory 30G \ --conf spark.network.timeout=10000000 \ --conf spark.executor.extraJavaOptions="-Dbigdl.engineType=mkldnn" \ --conf spark.driver.extraJavaOptions="-Dbigdl.engineType=mkldnn" \ --driver-class-path /root/analytics-zoo/dist/lib/analytics-zoo-bigdl_0.8.0-spark_2.1.0-0.6.0-SNAPSHOT-jar-with-dependencies.jar \ --class com.intel.analytics.zoo.examples.resnet.TrainImageNet \ /root/analytics-zoo/dist/lib/analytics-zoo-bigdl_0.8.0-spark_2.1.0-0.6.0-SNAPSHOT-jar-with-dependencies.jar \ --batchSize 1440 \ --nEpochs 5 --learningRate 0.1 --warmupEpoch 5 --maxLr 3.2 \ --depth 50 --classes 365 \ --cache ./models \ --memoryType PMEM \ -f hdfs://bdw:8020/user/root/places365_challenge_seq