uccross / skyhookdm-ceph-cls

Skyhook Data Management: Storage and management of tabular data in Ceph.
https://www.skyhookdm.com
GNU Lesser General Public License v2.1
13 stars 9 forks source link

Add SQL GROUP BY to Skyhook Flatbuffer processing #26

Closed jlefevre closed 4 years ago

jlefevre commented 5 years ago

This project will develop object-class methods methods to sort/group query result sets. This requires extending the current code (select/project/basic-aggregations - min/max/sum/count) to support groupby and/or orderby.

Skyhook flatbuffer data format is specified here.

Current processing includes select, project, simple aggregates min/max/count/sum. This is a row-oriented layout that uses Google Flatbuffers. Each row (or agg across all rows) is processed in a loop here.

For each row, first all of the predicates are checked for each column (data schema), and if they all pass, the projected columns (query_schema) are encoded into the return flatbuffer based upon their data type here.

Currently, simple aggregates are being accumulated during the applyPredicates function into a typed predicate object over all rows that pass (nothing is encoded into the return flatbuffer until the very end) here.

SQL GROUP BY statement re-organizes the rows returned by the query as indicated here / here.

This project requires implementing SQL GROUP BY by grouping the rows of the return flatbuffer by similar column values (such as group by the same ShippingDate). One straightforward way to do this is for each unique value of the grouping column, create a hashmap keyed on that value and store the associated row numbers in a list as that key's value. After all rows have been evaluated, iterate over the map and just encode the rows of each group using the existing encoding switch statement.

This project may require refactoring the encoding switch into a separate function so it can be reused by the existing code and the new groupby code. Testing can be done using our TPCH Lineitem table data, schema specified here.

A very similar method will also need implemented for our column-oriented processing function here, that uses Apache Arrow format.

Once the groupby functionality as described above is correct, the approach should be further optimized as appropriate to minimize memory footprint and/or processing complexity.

jlefevre commented 5 years ago

An extension of this is to sort the groups returned (SQL ORDER BY) and computing the current simple aggregates (min/max/sum/count) along with the GROUP BY statement. The simple aggregates are already computed here during applyPredicates, and one could use this existing agg computation per row, and then use the existing agg encoding to encode the simple agg of each group once you are ready to encode the groups into the return flatbuffer.

Sorting could preferably be done in a more general way so that it could used to sort the groups or all of the rows in the return flatbuffer.