Open systay opened 3 months ago
Looks quite good! We don't really care about the chunk size though, right? Would it not always be better to send them all in one go? Also, we could probably optimize this even more to only send the values for a specific shard in that chunk. Kinda like how we do it for IN
. This way it will always better than the SingleUnique
index and we don't need to make a decision on runtime
Similarly, de-duping would go along way for many queries. The same query for the same row in the right table is currently run for each occurrence in the left. This batching could allow de-duping, at least across the batching window.
I have also thought about similar issues before and considered rewriting RHS equality queries to use IN, just like in your example with RHS. The query would become
SELECT u.name FROM user AS u WHERE u.bar = 'foo' AND u.id IN (:ue_ids)
where ue_ids
represents all ue_id
values obtained from LHS.
I would like to ask about the differences between using the VALUES
statement in MySQL and the IN clause.
I have also thought about similar issues before and considered rewriting RHS equality queries to use IN, just like in your example with RHS. The query would become
SELECT u.name FROM user AS u WHERE u.bar = 'foo' AND u.id IN (:ue_ids)
where
ue_ids
represents allue_id
values obtained from LHS.I would like to ask about the differences between using the
VALUES
statement in MySQL and the IN clause.
The IN clause is useful when only retrieving data from the RHS columns. For instance:
SELECT ue.id, ue.foo, ue.bar
FROM user u
JOIN user_extra ue ON u.id = ue.id
Here, you can use the IN clause to optimize the query:
SELECT ue.id, ue.foo, ue.bar
FROM user_extra ue
WHERE ue.id IN (:ue_ids)
However, if multiple columns from the LHS are needed, the IN clause falls short. Consider this query:
SELECT u.foo, ue.bar
FROM user u
JOIN user_extra ue ON u.id = ue.id
In this case, you need both u.id
and u.foo
from the LHS. The IN clause can’t be used effectively because it doesn’t support matching multiple columns together. You need to ensure the pairing of u.id
and u.foo
remains consistent across rows. This is where the VALUES
statement excels, as it allows you to batch these paired values and maintain their association.
Using the VALUES statement, the query becomes:
SELECT u.foo, ue.bar
FROM user u
JOIN (VALUES ROW(1, 'foo'), ROW(2, 'bar')) AS ue(id, foo)
ON u.id = ue.id
This approach keeps the paired values together, ensuring data integrity across the join operation.
I wanted to clarify how multi-join situations and complex expressions are handled with the ValuesJoin optimization:
SELECT u.name, o.order_date, p.product_name, u.loyalty_score + o.order_value AS customer_value
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN products p ON o.product_id = p.id
WHERE u.country = 'USA' AND o.status = 'completed' AND p.category = 'electronics';
Query 'users' shard:
SELECT u.id, u.name, u.loyalty_score FROM users u WHERE u.country = 'USA';
Join with 'orders' using ValuesJoin:
SELECT u.id, u.name, o.id AS order_id, o.order_date, o.product_id,
u.loyalty_score + o.order_value AS customer_value
FROM (VALUES ROW(1, 'Alice', 100), ROW(2, 'Bob', 150)) AS u(id, name, loyalty_score)
JOIN orders o ON u.id = o.user_id
WHERE o.status = 'completed';
Final join with 'products':
SELECT v.name, v.order_date, p.product_name, v.customer_value
FROM (VALUES
ROW(1, 'Alice', 101, '2023-01-01', 201, 250),
ROW(2, 'Bob', 102, '2023-01-02', 202, 300)
) AS v(user_id, name, order_id, order_date, product_id, customer_value)
JOIN products p ON v.product_id = p.id
WHERE p.category = 'electronics';
u.loyalty_score + o.order_value
) can be computed in earlier join stages.In scenarios where the LHS yields few rows and the RHS can be optimized using a SingleUnique index, the engine should dynamically decide to revert to row-by-row processing. This would occur at runtime, ensuring that the most efficient execution strategy is chosen based on the data characteristics.
Talking this over with @harshit-gangal, we don't think it actually makes sense to fall back to single rows here. If the RHS is a scatter, it would just mean that we need to send many scatter queries, which is not really preferable. We don't see a situation right now that would require falling back, so currently we are thinking all joins would be ValueJoins
@frouioui and I talked about if there are any situations where it still would make sense to use ApplyJoin
.
// Determine the join mode based on MySQL version and query context
mode := ValuesJoin
if version < 8.0.19 || (ctx.wantsFastFirstRow() && !rhs.isGreedy()) {
mode = ApplyJoin
}
Explanation:
Greedy Operator in Query Planning: In query planning, a "greedy" operator is one that consumes all of its input before producing any output. This is in contrast to non-greedy (or "lazy") operators that can produce output incrementally as they process their input.
Here's a list of MySQL operators or constructs that benefit from fast first row retrieval:
In OLTP mode, all operators are greedy. Today we do not differentiate between plans for these two modes. We probably should start that if we introduce the greedy property on operators.
Summary
This proposal suggests an optimization for join operations in Vitess by introducing chunking of rows using the
VALUES
statement in MySQL. The goal is to reduce the number of network round-trips and improve query performance by batching data transfer.Background
Currently, Vitess handles joins between sharded tables by using bind variables to fetch rows from the right-hand side (RHS) of the join, based on values obtained from the left-hand side (LHS) - a so called nested loop join. This approach can lead to many network round-trips, especially when the join involves a significant number of rows, impacting overall performance.
Proposal
I propose leveraging the
VALUES
statement to batch multiple rows together, thus minimizing the number of network round-trips. The engine would generate a query using theVALUES
clause for the RHS of the join, allowing multiple rows to be processed in a single request.The
VALUES
statement in MySQL is used to construct a set of rows that can be treated as a table. It allows you to define multiple rows of data, where each row is represented asROW(value1, value2, ...)
. This virtual table can then be used in various SQL operations, such as joins or unions, making it a useful tool for batch processing data in queries. For example,VALUES ROW(1, 'foo'), ROW(2, 'bar')
creates a temporary result set with two rows and two columns, which can be aliased and joined with other tables in a query.Example Change:
The current process uses a query like:
With bind variables, the RHS query becomes:
The proposed query with chunking:
In this case,
ROW(1,2), ROW(4,5)
represent batched data for the join.Row-by-Row Fallback
In scenarios where the LHS yields few rows and the RHS can be optimized using a
SingleUnique
index, the engine should dynamically decide to revert to row-by-row processing. This would occur at runtime, ensuring that the most efficient execution strategy is chosen based on the data characteristics.Considerations
VALUES
in this context needs to be validated across different MySQL versions and configurations used in Vitess.Conclusion
This could give us a nice performance boost in a lot of situations, and it does not sound too hard to implement. I think it could be worth spending some time on.