oap-project / gazelle_plugin

Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.
Apache License 2.0
256 stars 76 forks source link

TPC-DS q23a.sql will fail in the next few round when we run power test for 20 rounds with master branch. #412

Open haojinIntel opened 3 years ago

haojinIntel commented 3 years ago

We try to run TPC-DS power test for 20 rounds to verify the stability of gazelle_plugin. The cluster contains 3 workers and each worker has 512GB DRAM. The data scale is 1.5TB. Q23a.sql will fail in next few rounds which cause the failure of thrift-server. The error message is showed below:


. . . . . . . . . . . . . . > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 5112 (run at AccessController.java:0) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: /disk/1/data/nm/usercache/root/appcache/application_1626231515702_0002/blockmgr-01d2a6f0-a284-495d-8ae8-a7e1e619cf41/09/shuffle_1368_867103_0.index    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:770)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:685)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)      at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)       at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)       at com.intel.oap.expression.ColumnarSorter$$anon$1.hasNext(ColumnarSorter.scala:143)    at com.intel.oap.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:47)         at com.intel.oap.execution.ColumnarWholeStageCodegenExec.$anonfun$doExecuteColumnar$5(ColumnarWholeStageCodegenExec.scala:408)  at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:106)    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)   at org.apache.spark.scheduler.Task.run(Task.scala:131)  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)      at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.file.NoSuchFileException: /disk/1/data/nm/usercache/root/appcache/application_1626231515702_0002/blockmgr-01d2a6f0-a284-495d-8ae8-a7e1e619cf41/09/shuffle_1368_867103_0.index      at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)       at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)    at java.nio.file.Files.newByteChannel(Files.java:361)   at java.nio.file.Files.newByteChannel(Files.java:407)   at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:351)         at org.apache.spark.storage.BlockManager.getHostLocalShuffleData(BlockManager.scala:629)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlock(ShuffleBlockFetcherIterator.scala:450)      at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2(ShuffleBlockFetcherIterator.scala:529)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2$adapted(ShuffleBlockFetcherIterator.scala:528)  at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85)      at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82)     at scala.collection.immutable.List.forall(List.scala:89)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1(ShuffleBlockFetcherIterator.scala:528)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1$adapted(ShuffleBlockFetcherIterator.scala:527)  at scala.collection.Iterator.forall(Iterator.scala:953)         at scala.collection.Iterator.forall$(Iterator.scala:951)        at scala.collection.AbstractIterator.forall(Iterator.scala:1429)        at scala.collection.IterableLike.forall(IterableLike.scala:77)  at scala.collection.IterableLike.forall$(IterableLike.scala:76)         at scala.collection.AbstractIterable.forall(Iterable.scala:56)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchMultipleHostLocalBlocks(ShuffleBlockFetcherIterator.scala:527)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlocks(ShuffleBlockFetcherIterator.scala:516)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$initialize$4(ShuffleBlockFetcherIterator.scala:562)    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$initialize$4$adapted(ShuffleBlockFetcherIterator.scala:562)    at scala.Option.foreach(Option.scala:407)       at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:562)       at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:171)   at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)      at org.apache.spark.sql.execution.ShuffledColumnarBatchRDD.compute(ShuffledColumnarBatchRDD.scala:128)  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     ... 28 more
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:361)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:263)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
        at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:43)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:263)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:258)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:272)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 5112 (run at AccessController.java:0) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: /disk/1/data/nm/usercache/root/appcache/application_1626231515702_0002/blockmgr-01d2a6f0-a284-495d-8ae8-a7e1e619cf41/09/shuffle_1368_867103_0.index         at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:770)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:685)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)      at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)   at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)       at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)       at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)       at com.intel.oap.expression.ColumnarSorter$$anon$1.hasNext(ColumnarSorter.scala:143)    at com.intel.oap.vectorized.CloseableColumnBatchIterator.hasNext(CloseableColumnBatchIterator.scala:47)         at com.intel.oap.execution.ColumnarWholeStageCodegenExec.$anonfun$doExecuteColumnar$5(ColumnarWholeStageCodegenExec.scala:408)  at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:106)    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)   at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)   at org.apache.spark.scheduler.Task.run(Task.scala:131)  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)      at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.file.NoSuchFileException: /disk/1/data/nm/usercache/root/appcache/application_1626231515702_0002/blockmgr-01d2a6f0-a284-495d-8ae8-a7e1e619cf41/09/shuffle_1368_867103_0.index      at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)       at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)    at java.nio.file.Files.newByteChannel(Files.java:361)   at java.nio.file.Files.newByteChannel(Files.java:407)   at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:351)         at org.apache.spark.storage.BlockManager.getHostLocalShuffleData(BlockManager.scala:629)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlock(ShuffleBlockFetcherIterator.scala:450)      at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2(ShuffleBlockFetcherIterator.scala:529)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$2$adapted(ShuffleBlockFetcherIterator.scala:528)  at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85)      at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82)     at scala.collection.immutable.List.forall(List.scala:89)        at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1(ShuffleBlockFetcherIterator.scala:528)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$fetchMultipleHostLocalBlocks$1$adapted(ShuffleBlockFetcherIterator.scala:527)  at scala.collection.Iterator.forall(Iterator.scala:953)         at scala.collection.Iterator.forall$(Iterator.scala:951)        at scala.collection.AbstractIterator.forall(Iterator.scala:1429)        at scala.collection.IterableLike.forall(IterableLike.scala:77)  at scala.collection.IterableLike.forall$(IterableLike.scala:76)         at scala.collection.AbstractIterable.forall(Iterable.scala:56)  at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchMultipleHostLocalBlocks(ShuffleBlockFetcherIterator.scala:527)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchHostLocalBlocks(ShuffleBlockFetcherIterator.scala:516)     at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$initialize$4(ShuffleBlockFetcherIterator.scala:562)    at org.apache.spark.storage.ShuffleBlockFetcherIterator.$anonfun$initialize$4$adapted(ShuffleBlockFetcherIterator.scala:562)    at scala.Option.foreach(Option.scala:407)       at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:562)       at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:171)   at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)      at org.apache.spark.sql.execution.ShuffledColumnarBatchRDD.compute(ShuffledColumnarBatchRDD.scala:128)  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)     ... 28 more
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1763)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2437)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) (state=,code=0)
Closing: 0: jdbc:hive2://vsr215:10000

Our test can pass when the latest commit is "e4782f0aaa6cee899bed5a3cb1d9ba2b2c219461".

haojinIntel commented 3 years ago

@zhouyuan @zhixingheyi-tian Please help to track the issue. Thanks!

zhixingheyi-tian commented 3 years ago

cc @weiting-chen

haojinIntel commented 3 years ago

If we use another cluster which contains 1 master and 3 workers and each worker has 384GB DRAM, q23a.sql will fail on the 1st round. This is the tpcds-kit version I used: https://github.com/davies/tpcds-kit.git.

zhouyuan commented 3 years ago

@haojinIntel thanks fore reporting, the log shows some container got killed during the tests, probably due to memory issue

this is due to big memory footprint and lack of spill support. we have two pending PRs which should fix this https://github.com/oap-project/gazelle_plugin/pull/387 https://github.com/oap-project/gazelle_plugin/pull/369