systay commented 3 months ago

Summary

This proposal suggests an optimization for join operations in Vitess by introducing chunking of rows using the VALUES statement in MySQL. The goal is to reduce the number of network round-trips and improve query performance by batching data transfer.

Background

Currently, Vitess handles joins between sharded tables by using bind variables to fetch rows from the right-hand side (RHS) of the join, based on values obtained from the left-hand side (LHS) - a so called nested loop join. This approach can lead to many network round-trips, especially when the join involves a significant number of rows, impacting overall performance.

Proposal

I propose leveraging the VALUES statement to batch multiple rows together, thus minimizing the number of network round-trips. The engine would generate a query using the VALUES clause for the RHS of the join, allowing multiple rows to be processed in a single request.

The VALUES statement in MySQL is used to construct a set of rows that can be treated as a table. It allows you to define multiple rows of data, where each row is represented as ROW(value1, value2, ...). This virtual table can then be used in various SQL operations, such as joins or unions, making it a useful tool for batch processing data in queries. For example, VALUES ROW(1, 'foo'), ROW(2, 'bar') creates a temporary result set with two rows and two columns, which can be aliased and joined with other tables in a query.

Example Change:

The current process uses a query like:

SELECT u.name, ue.foo
FROM user AS u
JOIN user_extra AS ue ON u.id = ue.id
WHERE u.bar = 'foo' AND ue.bar = 1

With bind variables, the RHS query becomes:

SELECT u.name FROM user AS u WHERE u.bar = 'foo' AND u.id = :ue_id

The proposed query with chunking:

SELECT u.name, ue.foo
FROM user AS u
JOIN (VALUES ROW(1,2), ROW(4,5)) AS ue(foo, id)
ON u.id = ue.id
WHERE u.bar = 'foo'

In this case, ROW(1,2), ROW(4,5) represent batched data for the join.

Row-by-Row Fallback

In scenarios where the LHS yields few rows and the RHS can be optimized using a SingleUnique index, the engine should dynamically decide to revert to row-by-row processing. This would occur at runtime, ensuring that the most efficient execution strategy is chosen based on the data characteristics.

Considerations

Chunk Size: Determining the optimal chunk size is crucial. It must balance the benefits of reduced network trips with the potential overhead of handling large batches.
SQL Compatibility: The usage of VALUES in this context needs to be validated across different MySQL versions and configurations used in Vitess.

Conclusion

This could give us a nice performance boost in a lot of situations, and it does not sound too hard to implement. I think it could be worth spending some time on.

GuptaManan100 commented 3 months ago

Looks quite good! We don't really care about the chunk size though, right? Would it not always be better to send them all in one go? Also, we could probably optimize this even more to only send the values for a specific shard in that chunk. Kinda like how we do it for IN. This way it will always better than the SingleUnique index and we don't need to make a decision on runtime

GrahamCampbell commented 3 months ago

Similarly, de-duping would go along way for many queries. The same query for the same row in the right table is currently run for each occurrence in the left. This batching could allow de-duping, at least across the batching window.

wangweicugw commented 3 months ago

I have also thought about similar issues before and considered rewriting RHS equality queries to use IN, just like in your example with RHS. The query would become

SELECT u.name FROM user AS u WHERE u.bar = 'foo' AND u.id IN (:ue_ids)

where ue_ids represents all ue_id values obtained from LHS.

I would like to ask about the differences between using the VALUES statement in MySQL and the IN clause.

systay commented 3 months ago

I have also thought about similar issues before and considered rewriting RHS equality queries to use IN, just like in your example with RHS. The query would become
SELECT u.name FROM user AS u WHERE u.bar = 'foo' AND u.id IN (:ue_ids)
where ue_ids represents all ue_id values obtained from LHS.

I would like to ask about the differences between using the VALUES statement in MySQL and the IN clause.

The IN clause is useful when only retrieving data from the RHS columns. For instance:

SELECT ue.id, ue.foo, ue.bar 
FROM user u 
JOIN user_extra ue ON u.id = ue.id

Here, you can use the IN clause to optimize the query:

SELECT ue.id, ue.foo, ue.bar 
FROM user_extra ue 
WHERE ue.id IN (:ue_ids)

However, if multiple columns from the LHS are needed, the IN clause falls short. Consider this query:

SELECT u.foo, ue.bar 
FROM user u 
JOIN user_extra ue ON u.id = ue.id

In this case, you need both u.id and u.foo from the LHS. The IN clause can’t be used effectively because it doesn’t support matching multiple columns together. You need to ensure the pairing of u.id and u.foo remains consistent across rows. This is where the VALUES statement excels, as it allows you to batch these paired values and maintain their association.

Using the VALUES statement, the query becomes:

SELECT u.foo, ue.bar 
FROM user u 
JOIN (VALUES ROW(1, 'foo'), ROW(2, 'bar')) AS ue(id, foo) 
ON u.id = ue.id

This approach keeps the paired values together, ensuring data integrity across the join operation.

systay commented 2 months ago

Multi-Join and Expression Optimization Clarification

I wanted to clarify how multi-join situations and complex expressions are handled with the ValuesJoin optimization:

Example Query

SELECT u.name, o.order_date, p.product_name, u.loyalty_score + o.order_value AS customer_value
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN products p ON o.product_id = p.id
WHERE u.country = 'USA' AND o.status = 'completed' AND p.category = 'electronics';

Execution Steps

Query 'users' shard:

SELECT u.id, u.name, u.loyalty_score FROM users u WHERE u.country = 'USA';

Join with 'orders' using ValuesJoin:

SELECT u.id, u.name, o.id AS order_id, o.order_date, o.product_id, 
      u.loyalty_score + o.order_value AS customer_value
FROM (VALUES ROW(1, 'Alice', 100), ROW(2, 'Bob', 150)) AS u(id, name, loyalty_score)
JOIN orders o ON u.id = o.user_id 
WHERE o.status = 'completed';

Final join with 'products':

SELECT v.name, v.order_date, p.product_name, v.customer_value
FROM (VALUES 
   ROW(1, 'Alice', 101, '2023-01-01', 201, 250),
   ROW(2, 'Bob', 102, '2023-01-02', 202, 300)
) AS v(user_id, name, order_id, order_date, product_id, customer_value)
JOIN products p ON v.product_id = p.id
WHERE p.category = 'electronics';

Key Points

Complex expressions (like u.loyalty_score + o.order_value) can be computed in earlier join stages.
Computed expressions are passed as single values to subsequent stages, reducing data transfer.
Final stage queries can directly use these pre-computed values.
This approach optimizes both data locality and computation distribution.

systay commented 2 months ago

In scenarios where the LHS yields few rows and the RHS can be optimized using a SingleUnique index, the engine should dynamically decide to revert to row-by-row processing. This would occur at runtime, ensuring that the most efficient execution strategy is chosen based on the data characteristics.

Talking this over with @harshit-gangal, we don't think it actually makes sense to fall back to single rows here. If the RHS is a scatter, it would just mean that we need to send many scatter queries, which is not really preferable. We don't see a situation right now that would require falling back, so currently we are thinking all joins would be ValueJoins

systay commented 2 months ago

Choosing ApplyJoin or ValuesJoin?

@frouioui and I talked about if there are any situations where it still would make sense to use ApplyJoin.

// Determine the join mode based on MySQL version and query context
mode := ValuesJoin
if version < 8.0.19 || (ctx.wantsFastFirstRow() && !rhs.isGreedy()) {
    mode = ApplyJoin
}

Explanation:

We use ValuesJoin by default.
We switch to ApplyJoin if: a. The MySQL version is older than 8.0.19, OR b. The query context indicates a preference for fast retrieval of the first row(s) AND the right-hand side of the join is not greedy. If it is greedy, it will still consume everything before returning anything.

Greedy Operator in Query Planning: In query planning, a "greedy" operator is one that consumes all of its input before producing any output. This is in contrast to non-greedy (or "lazy") operators that can produce output incrementally as they process their input.

Here's a list of MySQL operators or constructs that benefit from fast first row retrieval:

LIMIT: Any query using LIMIT benefits from fast first row retrieval.
EXISTS subqueries: These only need to find one matching row to return true.
ANY or SOME subqueries: These can short-circuit as soon as they find a matching row.

OLTP/OLAP plan cache

In OLTP mode, all operators are greedy. Today we do not differentiate between plans for these two modes. We probably should start that if we introduce the greedy property on operators.

vitessio / vitess

Proposal to Optimize Join Operations with Chunking Using VALUES Statement #16508

Summary

Background

Proposal

Row-by-Row Fallback

Considerations

Conclusion

Multi-Join and Expression Optimization Clarification

Example Query

Execution Steps

Key Points

Choosing ApplyJoin or ValuesJoin?

OLTP/OLAP plan cache