Open Yriuns opened 2 years ago
Hi, this is not a bug. The index is (a, b, c, d). And there's IN condition on c, equal condition on a and b. You want to return data with the order of d.
Since there's multiple values the column c returned, if you don't add top-n, the data get is with the order of (c, d), not d.
And when the request is pushed down to tikv. The base request unit is region
, not range
.
If the two ranges are in the same region. There's only one request. So we cannot just push limit down to the TiKV
Hi, this is not a bug. The index is (a, b, c, d). And there's IN condition on c, equal condition on a and b. You want to return data with the order of d.
Since there's multiple values the column c returned, if you don't add top-n, the data get is with the order of (c, d), not d.
I'm afraid you misunderstood my question.
The optimal execution plan is:
tidb
ask tikv
to scan the first 10 records of (a=0, b=1, c=2)
, with their id
fields. (Limit@cop
)tidb
ask tikv
to scan the first 10 records of (a=0, b=1, c=3)
, with their id
fields. (Limit@cop
)tidb
perform a Top-10
on these 20 records, order by id. (TopN@root
)There is no need to let tikv
perform Top-10
, right?
And when the request is pushed down to tikv. The base request unit is
region
, notrange
. If the two ranges are in the same region. There's only one request. So we cannot just push limit down to the TiKV
Oh... It seems that this is the root cause. Will it be supported in the future? That is, the executors can handle multiple ranges independently.
I think such query is quite common. If the Limit
can be push down, we can save a lot of unnecessary scan.
@winoros Or is it possible for planner to generate such execution plan (by itself)?
SELECT id, a, b, c, d
FROM (
(SELECT id, a, b, c, d FROM t WHERE a = 0 AND b = 1 AND c = 2 ORDER BY id ASC LIMIT 10)
UNION
(SELECT id, a, b, c, d FROM t WHERE a = 0 AND b = 1 AND c = 3 ORDER BY id ASC LIMIT 10)
) r
ORDER BY id ASC
LIMIT 10;
And when the request is pushed down to tikv. The base request unit is
region
, notrange
. If the two ranges are in the same region. There's only one request. So we cannot just push limit down to the TiKVOh... It seems that this is the root cause. Will it be supported in the future? That is, the executors can handle multiple ranges independently.
I think such query is quite common. If the
Limit
can be push down, we can save a lot of unnecessary scan.
We cannot guarantee such behavior. Since that would increase the number of requests we send, increasing the pressure of the TiKV side.
And when the request is pushed down to tikv. The base request unit is
region
, notrange
. If the two ranges are in the same region. There's only one request. So we cannot just push limit down to the TiKVOh... It seems that this is the root cause. Will it be supported in the future? That is, the executors can handle multiple ranges independently. I think such query is quite common. If the
Limit
can be push down, we can save a lot of unnecessary scan.We cannot guarantee such behavior. Since that would increase the number of requests we send, increasing the pressure of the TiKV side.
I know that, I mean the cost-based optimizer should be able to find the minimum one between the cost of more RPC and the cost of more Scan.
In our senario, the number of records that match a = 0 AND b = 1 AND c IN (2, 3)
may reach several millions, which means a range scan of several millions in tikv
. However, if the planner can rewrite the SQL to the UNION
format, it only needs to scan 20 records.
And when the request is pushed down to tikv. The base request unit is
region
, notrange
. If the two ranges are in the same region. There's only one request. So we cannot just push limit down to the TiKVOh... It seems that this is the root cause. Will it be supported in the future? That is, the executors can handle multiple ranges independently. I think such query is quite common. If the
Limit
can be push down, we can save a lot of unnecessary scan.We cannot guarantee such behavior. Since that would increase the number of requests we send, increasing the pressure of the TiKV side.
Hi there, I make a pull request to enable optimizer to generate an execution plan with Limit
push down. Will you take a look? If you agree with the idea of this PR, I will continue to improve it. Otherwise, I will just closed it 😢
And when the request is pushed down to tikv. The base request unit is
region
, notrange
. If the two ranges are in the same region. There's only one request. So we cannot just push limit down to the TiKVOh... It seems that this is the root cause. Will it be supported in the future? That is, the executors can handle multiple ranges independently. I think such query is quite common. If the
Limit
can be push down, we can save a lot of unnecessary scan.We cannot guarantee such behavior. Since that would increase the number of requests we send, increasing the pressure of the TiKV side.
Hi there, I make a pull request to enable optimizer to generate an execution plan with
Limit
push down. Will you take a look? If you agree with the idea of this PR, I will continue to improve it. Otherwise, I will just closed it 😢
Let me contact our PM first.
Oh, I just ask, does the RDBMS you currently use support this behavior?
Oh, I just ask, does the RDBMS you currently use support this behavior?
I provide an optimizer hint to enable this, so I think your concern about "behavior" is uncessary. This is just an identity transformation of relational algebra.
Oh, I just ask, does the RDBMS you currently use support this behavior?
I provide an optimizer hint to enable this, so I think your concern about "behavior" is uncessary. This is just an identity transformation of relational algebra.
@Yriuns Yes, it's not elegant enough to put into TiDB as a release behavior. A switch usually means that if we don't enable it, most of the users won't use it.
Oh, I just ask, does the RDBMS you currently use support this behavior?
I provide an optimizer hint to enable this, so I think your concern about "behavior" is uncessary. This is just an identity transformation of relational algebra.
@Yriuns Yes, it's not elegant enough to put into TiDB as a release behavior. A switch usually means that if we don't enable it, most of the users won't use it.
Agreed, this feature will increase the burden of physical plan optimization. It should be used only if the gain exceeds the CPU overhead. Under current optimizer framework, it seems impossible for the optimizer itself to decide whether to use or not according to the cost.
But some is better than none, right? : )
And what's the official attitude about optimizer hint? I just notice that v6.1 add a LEADING
hint.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
2. What did you expect to see? (Required)
The field in
WHERE
andORDER BY
clause all hit the indexa_b_c_id
, so theLimit
operator should be pushed down totikv
's each range, thentidb
perform a finalTopN
, something like3. What did you see instead (Required)
The execution plan choose push down
TopN
rather thanLimit
, which results in an unnecessaryScan + Sort
.4. What is your TiDB version? (Required)