Optimize join ordering in N-table joins where N > 2

google / lovefield

Lovefield is a relational database for web apps. Written in JavaScript, works cross-browser. Provides SQL-like APIs that are fast, safe, and easy to use.

https://google.github.io/lovefield/

Apache License 2.0

6.82k stars 367 forks source link

Optimize join ordering in N-table joins where N > 2 #41

Open arthurhsu opened 9 years ago

arthurhsu commented 9 years ago

Join ordering is a pretty heavy concept in query optimization. The available approaches can get very involved and maybe outside the scope of Lovefield.

As an example of different levels of addressing the issue consider the case of a 3 vs 4 table join. A 3 table join graph has always the same chain structure as A -> B-> C, where edges in this graph represent a join condition between (leftTable, rightTable). On the other hand, In a 4 table join case the graph can have different structure (chain, cycle, star, clique etc), and therefore optimizing such graphs is a harder problem.

Need to find the extend to which Lovefield should address the join ordering optimization issue. There are simplifications that can be made, for example considering left-deep only trees.

At the very least, for the 3table join case, there should be no unnecessary cross-product operations, which with our current approach is not guaranteed (depends on the order of the tables as supplied in the query).

agershun commented 9 years ago

I think you can start with of cost function which depends on the number of estimated records in each table and the sequence of these table (and may be other parameters). One approach - simply model it with multiple samples runs (with different order of joined tables, and then process it with neural network or any other machine learning methods to esimate this function coefficients.

I suppose that this function will be different for Lovefield and other databases, because the cost of joining is different for different engines. If you would like we can compare database engine coefficients on the different sets of data.

The problem is interesting, I will try to model this task and send you results.

agershun commented 9 years ago

BTW You are talking about INNER JOINs only, right? Because other JOINs are non-commutative, AFAIK.

agershun commented 9 years ago

I tried to model the situation with four joined tables and compare results in direct and reversed order. Please, see this test file

    SELECT * 
      FROM one
      INNER JOIN two ON one.b = two.b
      INNER JOIN three ON two.c = three.c
      INNER JOIN four ON three.d = four.d;

   SELECT * 
      FROM four 
      INNER JOIN three ON three.d = four.d
      INNER JOIN two ON two.c = three.c
      INNER JOIN one ON one.b = two.b;

The hypothesis was: direct order is faster if number of records in first two tables more than next two records. The probability of positive test was about 60%. It does not worth for special optimization (of course, for these paticular kind of joins).

Of course, there are many factors, which affects on the result, and preindexation is the first one.