intel / hdk

A low-level execution library for analytic data processing.
Apache License 2.0
31 stars 14 forks source link

Reduce copy for result sets in QueryMemoryInitializer #669

Closed akroviakov closed 1 year ago

akroviakov commented 1 year ago

This PR is another step in making HDK more fragment transparent. GPU in multifrag mode experienced a fragment-count dependent overhead in QueryMemoryInitializer(). Example of a bad scenario: single threaded loop does 130 iterations, each iteration should copy col_buffers, column_frag_offsets, column_frag_sizes. With huge fragments the overhead is barely observable, however lowering the fragment size (e.g., to accommodate for CPU parallelism, split workload) makes it noticeable. There are cases where the time we spend in create QueryExecutionContext is 20% of the entire query (e.g., 80 Mil.rows with only 16 fragments, perform group by count on GPU).

This PR eliminates the described fragment-count dependent overhead. Example: create QueryExecutionContext takes 1ms for 3000 fragments.

Pitfalls? The fragment metadata pieces targeted by this PR seem to be immutable in ResultSet, they are only used for appending another ResultSet (combine pieces of metadata without modifying either) and iteration, so why not "share" these metadata pieces across result sets?

This "sharing" begins (we actually make a safe copy of the required metadata) right before the loop. This means that ResultSet does not reference the current state of a table, but we still avoid copying for 2nd, 3rd, ..., Nth ResultSet.