This PR is another step in making HDK more fragment transparent. GPU in multifrag mode experienced a fragment-count dependent overhead in QueryMemoryInitializer().
Example of a bad scenario: single threaded loop does 130 iterations, each iteration should copy col_buffers, column_frag_offsets, column_frag_sizes. With huge fragments the overhead is barely observable, however lowering the fragment size (e.g., to accommodate for CPU parallelism, split workload) makes it noticeable.
There are cases where the time we spend in create QueryExecutionContext is 20% of the entire query (e.g., 80 Mil.rows with only 16 fragments, perform group by count on GPU).
This PR eliminates the described fragment-count dependent overhead. Example: create QueryExecutionContext takes 1ms for 3000 fragments.
Pitfalls?
The fragment metadata pieces targeted by this PR seem to be immutable in ResultSet, they are only used for appending another ResultSet (combine pieces of metadata without modifying either) and iteration, so why not "share" these metadata pieces across result sets?
This "sharing" begins (we actually make a safe copy of the required metadata) right before the loop. This means that ResultSet does not reference the current state of a table, but we still avoid copying for 2nd, 3rd, ..., Nth ResultSet.
This PR is another step in making HDK more fragment transparent. GPU in multifrag mode experienced a fragment-count dependent overhead in
QueryMemoryInitializer()
. Example of a bad scenario: single threaded loop does 130 iterations, each iteration should copycol_buffers
,column_frag_offsets
,column_frag_sizes
. With huge fragments the overhead is barely observable, however lowering the fragment size (e.g., to accommodate for CPU parallelism, split workload) makes it noticeable. There are cases where the time we spend increate QueryExecutionContext
is 20% of the entire query (e.g., 80 Mil.rows with only 16 fragments, perform group by count on GPU).This PR eliminates the described fragment-count dependent overhead. Example:
create QueryExecutionContext
takes 1ms for 3000 fragments.Pitfalls? The fragment metadata pieces targeted by this PR seem to be immutable in
ResultSet
, they are only used for appending anotherResultSet
(combine pieces of metadata without modifying either) and iteration, so why not "share" these metadata pieces across result sets?This "sharing" begins (we actually make a safe copy of the required metadata) right before the loop. This means that
ResultSet
does not reference the current state of a table, but we still avoid copying for 2nd, 3rd, ..., NthResultSet
.