Open arthurhsu opened 9 years ago
I think you can start with of cost function which depends on the number of estimated records in each table and the sequence of these table (and may be other parameters). One approach - simply model it with multiple samples runs (with different order of joined tables, and then process it with neural network or any other machine learning methods to esimate this function coefficients.
I suppose that this function will be different for Lovefield and other databases, because the cost of joining is different for different engines. If you would like we can compare database engine coefficients on the different sets of data.
The problem is interesting, I will try to model this task and send you results.
BTW You are talking about INNER JOINs only, right? Because other JOINs are non-commutative, AFAIK.
I tried to model the situation with four joined tables and compare results in direct and reversed order. Please, see this test file
SELECT *
FROM one
INNER JOIN two ON one.b = two.b
INNER JOIN three ON two.c = three.c
INNER JOIN four ON three.d = four.d;
SELECT *
FROM four
INNER JOIN three ON three.d = four.d
INNER JOIN two ON two.c = three.c
INNER JOIN one ON one.b = two.b;
The hypothesis was: direct order is faster if number of records in first two tables more than next two records. The probability of positive test was about 60%. It does not worth for special optimization (of course, for these paticular kind of joins).
Of course, there are many factors, which affects on the result, and preindexation is the first one.
Join ordering is a pretty heavy concept in query optimization. The available approaches can get very involved and maybe outside the scope of Lovefield.
As an example of different levels of addressing the issue consider the case of a 3 vs 4 table join. A 3 table join graph has always the same chain structure as A -> B-> C, where edges in this graph represent a join condition between (leftTable, rightTable). On the other hand, In a 4 table join case the graph can have different structure (chain, cycle, star, clique etc), and therefore optimizing such graphs is a harder problem.
Need to find the extend to which Lovefield should address the join ordering optimization issue. There are simplifications that can be made, for example considering left-deep only trees.
At the very least, for the 3table join case, there should be no unnecessary cross-product operations, which with our current approach is not guaranteed (depends on the order of the tables as supplied in the query).