PMEM-SPILL of OAP-1.1 has performance regression comparing with OAP-1.1.1

haojinIntel commented 3 years ago

We use the same configuration to run K-means & SVM algorithm. The cluster contains 3 workers and each contains 1TB PMEM. The performance has 12.9% regression when running SVM 1.2TB scale and 28.6% regression when running K-means 500GB. The configuration of spark when running SVM is showed below:

spark.memory.pmem.extension.enabled true hibench.streambench.spark.checkpointPath /var/tmp spark.storage.unrollMemoryThreshold 1048576 hibench.streambench.spark.receiverNumber 4 spark.yarn.historyServer.address vsr219:18080 spark.memory.pmem.initial.size 450GB hibench.yarn.executor.cores 45 spark.executor.memory 90g hibench.streambench.spark.useDirectMode true spark.eventLog.dir hdfs://vsr219:9000/spark-history-server spark.driver.memory 10g spark.eventLog.enabled true spark.memory.spill.pmem.enabled false spark.driver.extraClassPath /opt/Beaver/OAP/oap_jar/pmem-rdd-cache-1.1.1-with-spark-3.1.1.jar:/opt/Beaver/OAP/oap_jar/pmem-common-1.1.1-with-spark-3.1.1.jar spark.kryo.unsafe true hibench.yarn.executor.num 6 spark.history.fs.logDirectory hdfs://vsr219:9000/spark-history-server spark.files /opt/Beaver/OAP/oap_jar/pmem-rdd-cache-1.1.1-with-spark-3.1.1.jar,/opt/Beaver/OAP/oap_jar/pmem-common-1.1.1-with-spark-3.1.1.jar spark.executor.extraClassPath ./pmem-rdd-cache-1.1.1-with-spark-3.1.1.jar:./pmem-common-1.1.1-with-spark-3.1.1.jar spark.history.fs.cleaner.enabled true spark.default.parallelism ${hibench.default.map.parallelism} spark.serializer.bufferedInputStreamSize 4096 hibench.streambench.spark.storageLevel 2 hibench.streambench.spark.batchInterval 100 hibench.spark.master yarn spark.sql.shuffle.partitions 200 spark.history.ui.port 18080 hibench.spark.home /opt/Beaver/spark spark.sql.warehouse.dir hdfs://vsr219:9000/spark-warehouse spark.serializer org.apache.spark.serializer.KryoSerializer spark.memory.pmem.initial.path /mnt/pmem0,/mnt/pmem1 hibench.streambench.spark.enableWAL false

haojinIntel commented 3 years ago

@zhixingheyi-tian @yma11 @winningsix @yeyuqiang Please help to track the performance issue.

winningsix commented 3 years ago

@haojinIntel Thanks for opening this tickets. @yma11 , any thing related to your code refactor? Didn't come up w/ any ideas why fails this.

yeyuqiang commented 3 years ago

Need to run with multiple executors to avoid slow task schedule in spark 3.1.1

oap-project / pmem-spill

PMEM-SPILL of OAP-1.1 has performance regression comparing with OAP-1.1.1 #43