prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

Flaky Test: TestHiveHistoryBasedStatsTracking.testHistoryBasedStatsCalculator #22339

Open ZacBlanco opened 7 months ago

ZacBlanco commented 7 months ago

Test failed in an unrelated PR: TestHiveHistoryBasedStatsTracking.testHistoryBasedStatsCalculator

Test Stacktrace

``` 2024-03-26T17:16:51.9977164Z [ERROR] Tests run: 2955, Failures: 1, Errors: 0, Skipped: 90, Time elapsed: 2,672.363 s <<< FAILURE! - in TestSuite 2024-03-26T17:16:51.9979666Z [ERROR] com.facebook.presto.hive.TestHiveHistoryBasedStatsTracking.testHistoryBasedStatsCalculator Time elapsed: 0.35 s <<< FAILURE! 2024-03-26T17:16:51.9981369Z java.lang.AssertionError: 2024-03-26T17:16:51.9981984Z Plan does not match, expected [ 2024-03-26T17:16:51.9982468Z 2024-03-26T17:16:51.9982757Z - anyTree 2024-03-26T17:16:51.9983280Z - node(ProjectNode) 2024-03-26T17:16:51.9983869Z expectedOutputRowCount(48.0) 2024-03-26T17:16:51.9984551Z - node 2024-03-26T17:16:51.9984897Z 2024-03-26T17:16:51.9985142Z ] but found [ 2024-03-26T17:16:51.9985460Z 2024-03-26T17:16:51.9987228Z - Output[PlanNodeId 6][orderkey, orderpriority, comment, custkey, ds, ts, _col6] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer] 2024-03-26T17:16:51.9989485Z Estimates: {source: HistoryBasedSourceInfo, rows: 48 (5.32kB), cpu: ?, memory: ?, network: ?} 2024-03-26T17:16:51.9990544Z _col6 := expr_5 (1:11) 2024-03-26T17:16:51.9992398Z - RemoteStreamingExchange[PlanNodeId 241][GATHER] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer] 2024-03-26T17:16:51.9994478Z Estimates: {source: CostBasedSourceInfo, rows: 223 (33.49kB), cpu: ?, memory: ?, network: ?} 2024-03-26T17:16:51.9999317Z - ScanProject[PlanNodeId 0,162][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpch, tableName=test_orders, analyzePartitionValues=Optional.empty}', layout='Optional[tpch.test_orders{filter=(substr(orderpriority, BIGINT'1', BIGINT'1')) = (VARCHAR'1')}]'}, projectLocality = LOCAL] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer] 2024-03-26T17:16:52.0003820Z Estimates: {source: CostBasedSourceInfo, rows: 223 (33.49kB), cpu: ?, memory: ?, network: ?}/{source: CostBasedSourceInfo, rows: 223 (33.49kB), cpu: ?, memory: ?, network: ?} 2024-03-26T17:16:52.0005475Z ds := VARCHAR'2020-09-02' 2024-03-26T17:16:52.0006266Z expr_5 := INTEGER'2' 2024-03-26T17:16:52.0007783Z LAYOUT: tpch.test_orders{filter=(substr(orderpriority, BIGINT'1', BIGINT'1')) = (VARCHAR'1')} 2024-03-26T17:16:52.0009075Z orderpriority := orderpriority:varchar(15):1:REGULAR (1:18) 2024-03-26T17:16:52.0010041Z orderkey := orderkey:bigint:0:REGULAR (1:18) 2024-03-26T17:16:52.0010875Z custkey := custkey:bigint:3:REGULAR (1:18) 2024-03-26T17:16:52.0011798Z ts := ts:varchar(5):-14:PARTITION_KEY (1:18) 2024-03-26T17:16:52.0012557Z :: [["00:02"]] 2024-03-26T17:16:52.0013251Z comment := comment:varchar(79):2:REGULAR (1:18) 2024-03-26T17:16:52.0014130Z ds:varchar(10):-13:PARTITION_KEY 2024-03-26T17:16:52.0014910Z :: [["2020-09-02"]] 2024-03-26T17:16:52.0015375Z 2024-03-26T17:16:52.0015593Z ] 2024-03-26T17:16:52.0016524Z at com.facebook.presto.sql.planner.assertions.PlanAssert.assertPlan(PlanAssert.java:57) 2024-03-26T17:16:52.0018140Z at com.facebook.presto.sql.planner.assertions.PlanAssert.assertPlan(PlanAssert.java:41) 2024-03-26T17:16:52.0020234Z at com.facebook.presto.tests.AbstractTestQueryFramework.lambda$assertPlan$7(AbstractTestQueryFramework.java:461) 2024-03-26T17:16:52.0022044Z at com.facebook.presto.transaction.TransactionBuilder.execute(TransactionBuilder.java:151) 2024-03-26T17:16:52.0023840Z at com.facebook.presto.tests.AbstractTestQueryFramework.assertPlan(AbstractTestQueryFramework.java:459) 2024-03-26T17:16:52.0025830Z at com.facebook.presto.tests.AbstractTestQueryFramework.assertPlan(AbstractTestQueryFramework.java:451) 2024-03-26T17:16:52.0027986Z at com.facebook.presto.hive.TestHiveHistoryBasedStatsTracking.assertPlan(TestHiveHistoryBasedStatsTracking.java:264) 2024-03-26T17:16:52.0030452Z at com.facebook.presto.hive.TestHiveHistoryBasedStatsTracking.testHistoryBasedStatsCalculator(TestHiveHistoryBasedStatsTracking.java:106) 2024-03-26T17:16:52.0032409Z at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2024-03-26T17:16:52.0033725Z at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2024-03-26T17:16:52.0035309Z at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2024-03-26T17:16:52.0036578Z at java.lang.reflect.Method.invoke(Method.java:498) 2024-03-26T17:16:52.0037964Z at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:135) 2024-03-26T17:16:52.0039592Z at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:673) 2024-03-26T17:16:52.0041072Z at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:220) 2024-03-26T17:16:52.0042577Z at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50) 2024-03-26T17:16:52.0044125Z at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:945) 2024-03-26T17:16:52.0045717Z at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:193) 2024-03-26T17:16:52.0047582Z at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146) 2024-03-26T17:16:52.0049185Z at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128) 2024-03-26T17:16:52.0050578Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2024-03-26T17:16:52.0051994Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2024-03-26T17:16:52.0053352Z at java.lang.Thread.run(Thread.java:750) 2024-03-26T17:16:52.0053887Z 2024-03-26T17:16:52.4747139Z [INFO] 2024-03-26T17:16:52.4752396Z [INFO] Results: 2024-03-26T17:16:52.4752991Z [INFO] 2024-03-26T17:16:52.4753363Z [ERROR] Failures: 2024-03-26T17:16:52.4756218Z [ERROR] TestHiveHistoryBasedStatsTracking.testHistoryBasedStatsCalculator:106->assertPlan:264->AbstractTestQueryFramework.assertPlan:451->AbstractTestQueryFramework.assertPlan:459->AbstractTestQueryFramework.lambda$assertPlan$7:461 Plan does not match, expected [ 2024-03-26T17:16:52.4758515Z 2024-03-26T17:16:52.4758711Z - anyTree 2024-03-26T17:16:52.4759151Z - node(ProjectNode) 2024-03-26T17:16:52.4759680Z expectedOutputRowCount(48.0) 2024-03-26T17:16:52.4760268Z - node 2024-03-26T17:16:52.4760512Z 2024-03-26T17:16:52.4760669Z ] but found [ 2024-03-26T17:16:52.4760896Z 2024-03-26T17:16:52.4762559Z - Output[PlanNodeId 6][orderkey, orderpriority, comment, custkey, ds, ts, _col6] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer] 2024-03-26T17:16:52.4764742Z Estimates: {source: HistoryBasedSourceInfo, rows: 48 (5.32kB), cpu: ?, memory: ?, network: ?} 2024-03-26T17:16:52.4765718Z _col6 := expr_5 (1:11) 2024-03-26T17:16:52.4771154Z - RemoteStreamingExchange[PlanNodeId 241][GATHER] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer] 2024-03-26T17:16:52.4773139Z Estimates: {source: CostBasedSourceInfo, rows: 223 (33.49kB), cpu: ?, memory: ?, network: ?} 2024-03-26T17:16:52.4777946Z - ScanProject[PlanNodeId 0,162][table = TableHandle {connectorId='hive', connectorHandle='HiveTableHandle{schemaName=tpch, tableName=test_orders, analyzePartitionValues=Optional.empty}', layout='Optional[tpch.test_orders{filter=(substr(orderpriority, BIGINT'1', BIGINT'1')) = (VARCHAR'1')}]'}, projectLocality = LOCAL] => [orderkey:bigint, orderpriority:varchar(15), comment:varchar(79), custkey:bigint, ds:varchar(10), ts:varchar(5), expr_5:integer] 2024-03-26T17:16:52.4782420Z Estimates: {source: CostBasedSourceInfo, rows: 223 (33.49kB), cpu: ?, memory: ?, network: ?}/{source: CostBasedSourceInfo, rows: 223 (33.49kB), cpu: ?, memory: ?, network: ?} 2024-03-26T17:16:52.4784088Z ds := VARCHAR'2020-09-02' 2024-03-26T17:16:52.4784823Z expr_5 := INTEGER'2' 2024-03-26T17:16:52.4785598Z LAYOUT: tpch.test_orders{filter=(substr(orderpriority, BIGINT'1', BIGINT'1')) = (VARCHAR'1')} 2024-03-26T17:16:52.4786815Z orderpriority := orderpriority:varchar(15):1:REGULAR (1:18) 2024-03-26T17:16:52.4787678Z orderkey := orderkey:bigint:0:REGULAR (1:18) 2024-03-26T17:16:52.4788415Z custkey := custkey:bigint:3:REGULAR (1:18) 2024-03-26T17:16:52.4789229Z ts := ts:varchar(5):-14:PARTITION_KEY (1:18) 2024-03-26T17:16:52.4789872Z :: [["00:02"]] 2024-03-26T17:16:52.4790466Z comment := comment:varchar(79):2:REGULAR (1:18) 2024-03-26T17:16:52.4791266Z ds:varchar(10):-13:PARTITION_KEY 2024-03-26T17:16:52.4791964Z :: [["2020-09-02"]] 2024-03-26T17:16:52.4792341Z 2024-03-26T17:16:52.4792478Z ] 2024-03-26T17:16:52.4792793Z [INFO] 2024-03-26T17:16:52.4793305Z [ERROR] Tests run: 2955, Failures: 1, Errors: 0, Skipped: 90 2024-03-26T17:16:52.4793995Z [INFO] 2024-03-26T17:16:52.4829455Z [INFO] ------------------------------------------------------------------------ 2024-03-26T17:16:52.4830220Z [INFO] BUILD FAILURE 2024-03-26T17:16:52.4830977Z [INFO] ------------------------------------------------------------------------ 2024-03-26T17:16:52.4839590Z [INFO] Total time: 44:37 min 2024-03-26T17:16:52.4845763Z [INFO] Finished at: 2024-03-26T17:16:52Z 2024-03-26T17:16:52.4846595Z [INFO] ------------------------------------------------------------------------ ```

jaystarshot commented 7 months ago

Just FYI: We have a similar test which uses is based on the same test case https://github.com/prestodb/presto/blob/master/redis-hbo-provider/src/test/java/com/facebook/presto/statistic/TestHistoryBasedRedisStatisticsTracking.java but this is not flaky. I think this is because we use a test redis client for this test instead of the inmemory handling done. If needed we can combine these 2 and move to presto-tests

rschlussel commented 7 months ago

I looked into https://github.com/prestodb/presto/issues/22204 and https://github.com/prestodb/presto/issues/22118 during the hackathon, and had a very difficult time reproducing the issue (managed once out of >30k runs).

my suspicion is actually that it's a very slightly flaky feature rather than a flaky test, though i would expect this to affect redis just as much unless it's run in a different test job that happens to encounter this issue less. Basically, I suspect that sometimes the runtime stats from certain nodes from certain operators don't get reported to the coordinator. I only have the test logs to go on, but you can see in this stack trace in this issue that the output node is using hbo stats with the correct output row count, but the test is testing the fliter row count and here the tablescan and filter nodes are using cbo stats. In the one time i was able to reproduce the issue locally https://github.com/prestodb/presto/issues/22204#issuecomment-2012554002, we had cbo stats for the scan, and hbo for the filter and output. the hbo stats were 0 for the filter node, but the output node had correct hbo stats.