pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
937 stars 409 forks source link

Random window function incorrect results in endless tests #9109

Closed yibin87 closed 1 month ago

yibin87 commented 1 month ago

Test run log: [2024/05/23 16:29:56.352 +08:00] [INFO] [window_function_test.go:511] ["exec sql"] [sql="set @@session.tidb_allow_mpp=on;set @@session.tidb_enforce_mpp=1;set @@tiflash_fine_grained_shuffle_stream_count = 0;set @@tiflash_fine_grained_shuffle_batch_size = 1;select lag(r) over w1, count(s) from window_function_0_part group by r having r > 10 window w1 as (partition by i, s, i order by id, i, r, s, dt) order by 1, 2 limit 71 offset 1;"] [2024/05/23 16:29:56.423 +08:00] [ERROR] [window_function_test.go:165] ["windowFunctionTest error"] [error="result inconsistent tso 449957854275436545, difference: row number: 1, row1: 21474836462250, row2: 21474836472250, tikv: [21474836462250 21474836472250], tiflash: [21474836472250 NULL2250], onlyKV: [], onlyFlash: []"] [stack="github.com/pingcap/endless/testcase/tiflash/window_function.checkErr\n\t/home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/window_function/window_function_test.go:165\ngithub.com/pingcap/endless/testcase/tiflash/window_function.glob..func1.2.2\n\t/home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/window_function/window_function_test.go:70\ngithub.com/onsi/ginkgo/internal/leafnodes.(runner).runSync\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/runner.go:113\ngithub.com/onsi/ginkgo/internal/leafnodes.(runner).run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/runner.go:64\ngithub.com/onsi/ginkgo/internal/leafnodes.(ItNode).Run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/it_node.go:26\ngithub.com/onsi/ginkgo/internal/spec.(Spec).runSample\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/spec/spec.go:215\ngithub.com/onsi/ginkgo/internal/spec.(Spec).Run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/spec/spec.go:138\ngithub.com/onsi/ginkgo/internal/specrunner.(SpecRunner).runSpec\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:200\ngithub.com/onsi/ginkgo/internal/specrunner.(SpecRunner).runSpecs\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:170\ngithub.com/onsi/ginkgo/internal/specrunner.(SpecRunner).Run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:66\ngithub.com/onsi/ginkgo/internal/suite.(*Suite).Run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/suite/suite.go:79\ngithub.com/onsi/ginkgo.runSpecsWithCustomReporters\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/ginkgo_dsl.go:245\ngithub.com/onsi/ginkgo.RunSpecs\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/ginkgo_dsl.go:220\ngithub.com/pingcap/endless/testcase/tiflash/window_function.TestWindowFunction\n\t/home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/window_function/window_function_suite_test.go:25\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1446"]

yibin87 commented 1 month ago

Full test log: window_incorrect.log

zanmato1984 commented 1 month ago

It's a bit corner and hard to reproduce. Lower to major.

windtalker commented 1 month ago

This is not a bug, the root cause is the test result is not stable The error query can be simplified as

select lag(r) over w1 as xx, count(s) from window_function_0_part group by r having r > 10 window w1 as (partition by i, s, i order by id, i, r, s, dt);

The plan is

mysql> explain select lag(r) over w1 as xx, count(s) from window_function_0_part group by r having r > 10 window w1 as (partition by i, s, i order by id, i, r, s, dt);
+----------------------------------------------+---------+--------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                           | estRows | task         | access object                | operator info                                                                                                                                                                                                                                                                                                                                                                                                   |
+----------------------------------------------+---------+--------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| TableReader_64                               | 2.81    | root         | partition:all                | MppVersion: 2, data:ExchangeSender_63                                                                                                                                                                                                                                                                                                                                                                           |
| └─ExchangeSender_63                          | 2.81    | mpp[tiflash] |                              | ExchangeType: PassThrough                                                                                                                                                                                                                                                                                                                                                                                       |
|   └─Projection_10                            | 2.81    | mpp[tiflash] |                              | Column#8->Column#9, Column#6->Column#10, stream_count: 8                                                                                                                                                                                                                                                                                                                                                        |
|     └─Window_62                              | 2.81    | mpp[tiflash] |                              | lag(test.window_function_0_part.r)->Column#8 over(partition by test.window_function_0_part.i, test.window_function_0_part.s, test.window_function_0_part.i order by test.window_function_0_part.id, test.window_function_0_part.i, test.window_function_0_part.r, test.window_function_0_part.s, test.window_function_0_part.dt), stream_count: 8                                                               |
|       └─Sort_31                              | 2.81    | mpp[tiflash] |                              | test.window_function_0_part.i, test.window_function_0_part.s, test.window_function_0_part.i, test.window_function_0_part.id, test.window_function_0_part.i, test.window_function_0_part.r, test.window_function_0_part.s, test.window_function_0_part.dt, stream_count: 8                                                                                                                                       |
|         └─ExchangeReceiver_30                | 2.81    | mpp[tiflash] |                              | stream_count: 8                                                                                                                                                                                                                                                                                                                                                                                                 |
|           └─ExchangeSender_29                | 2.81    | mpp[tiflash] |                              | ExchangeType: HashPartition, Compression: FAST, Hash Cols: [name: test.window_function_0_part.i, collate: binary], [name: test.window_function_0_part.s, collate: utf8mb4_bin], [name: test.window_function_0_part.i, collate: binary], stream_count: 8                                                                                                                                                         |
|             └─Projection_25                  | 2.81    | mpp[tiflash] |                              | Column#6, test.window_function_0_part.id, test.window_function_0_part.i, test.window_function_0_part.r, test.window_function_0_part.s, test.window_function_0_part.dt                                                                                                                                                                                                                                           |
|               └─HashAgg_26                   | 2.81    | mpp[tiflash] |                              | group by:test.window_function_0_part.r, funcs:sum(Column#11)->Column#6, funcs:firstrow(Column#12)->test.window_function_0_part.id, funcs:firstrow(Column#13)->test.window_function_0_part.i, funcs:firstrow(test.window_function_0_part.r)->test.window_function_0_part.r, funcs:firstrow(Column#15)->test.window_function_0_part.s, funcs:firstrow(Column#16)->test.window_function_0_part.dt, stream_count: 8 |
|                 └─ExchangeReceiver_28        | 2.81    | mpp[tiflash] |                              | stream_count: 8                                                                                                                                                                                                                                                                                                                                                                                                 |
|                   └─ExchangeSender_27        | 2.81    | mpp[tiflash] |                              | ExchangeType: HashPartition, Compression: FAST, Hash Cols: [name: test.window_function_0_part.r, collate: binary], stream_count: 8                                                                                                                                                                                                                                                                              |
|                     └─HashAgg_14             | 2.81    | mpp[tiflash] |                              | group by:test.window_function_0_part.r, funcs:count(test.window_function_0_part.s)->Column#11, funcs:firstrow(test.window_function_0_part.id)->Column#12, funcs:firstrow(test.window_function_0_part.i)->Column#13, funcs:firstrow(test.window_function_0_part.s)->Column#15, funcs:firstrow(test.window_function_0_part.dt)->Column#16                                                                         |
|                       └─TableFullScan_23     | 8100.00 | mpp[tiflash] | table:window_function_0_part | pushed down filter:gt(test.window_function_0_part.r, 10), keep order:false, PartitionTableScan:true                                                                                                                                                                                                                                                                                                             |
+----------------------------------------------+---------+--------------+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
13 rows in set (0.00 sec)

Note that in hashagg_14: group by:test.window_function_0_part.r, funcs:count(test.window_function_0_part.s)->Column#11, funcs:firstrow(test.window_function_0_part.id)->Column#12, funcs:firstrow(test.window_function_0_part.i)->Column#13, funcs:firstrow(test.window_function_0_part.s)->Column#15, funcs:firstrow(test.window_function_0_part.dt)->Column#16 All the columns like id/i/s/dt is not in the group by clause, which means the result of id/i/s/dt is actually random after hashagg_14. The following query shows this randomness clearly.

mysql> select r,count(s), id, i, s, dt from window_function_0_part where r > 10 group by r;
+------------+----------+-----+------+------+---------------------+
| r          | count(s) | id  | i    | s    | dt                  |
+------------+----------+-----+------+------+---------------------+
| 2147483646 |     2250 | 184 | NULL | NULL | 9999-12-03 00:00:00 |
| 123456.789 |     2250 | 364 | NULL | NULL | 9999-12-03 00:00:00 |
| 2147483647 |     2250 | 149 | NULL | NULL | 9999-12-02 23:59:59 |
+------------+----------+-----+------+------+---------------------+
3 rows in set (0.03 sec)

mysql> select r,count(s), id, i, s, dt from window_function_0_part where r > 10 group by r;
+------------+----------+------+-------------+------+---------------------+
| r          | count(s) | id   | i           | s    | dt                  |
+------------+----------+------+-------------+------+---------------------+
| 123456.789 |     2250 | 5548 | -2147483647 | NULL | 9999-12-03 00:00:00 |
| 2147483646 |     2250 | 5367 | -2147483647 | NULL | 1000-01-02 23:59:59 |
| 2147483647 |     2250 | 5332 | -2147483647 | NULL | 9999-12-03 00:00:00 |
+------------+----------+------+-------------+------+---------------------+
3 rows in set (0.03 sec)

Since the output of hashagg_14 is random, the query result is unstable. Close this bug as Not A Bug.