This PR introduces an asynchronous (batched) data fetching for L0 GPUs. Its purpose is to reduce end-to-end execution time of a workload.
Why?
We have recursive materializations (from Disk to CPU to GPU), once CPU has its buffer materialized we begin a data transfer and wait for its completion, but why should we? We can proceed to materialization of the next buffer and only care about all of the data being on GPU exactly before the kernel execution. This way we can overlap memcpy for CPU buffers with GPU data transfer and we don't lose on anything. Additionally, we also hide the latencies caused by the buffer manager which are constant and their impact grows linearly to the fragment count.
How?
We will batch transfers into a command list until the it reaches 128MB worth of data, then we actually execute a transfer. Right after sending all of the kernel parameters (that is, right before kernel execution) we wait until the data transfers are finished (barrier). Once the transfers are done, we want to keep and recycle the command lists to avoid the overhead of creating new ones or deleting them. Which is again a constant overhead that grows linearly to the fragment count.
Why this design?
L0 has something called an "immediate command list" that should seemingly do what we want, however, the documentation says that it "may be synchronous". Indeed, on PVC it is displaying synchronous behavior, but asynchronous on Arc GPUs. The proposed solution is asynchronous for both Arc and PVC. The 128MB granularity is arbitrary. This design showed good scalability to fragment count and overall less overhead (measured with ze_tracer) compared to the current solution in an isolated L0 benchmark.
Multithreaded fetching
Since we may have many fragments (e.g., many cores or we are in a heterogeneous mode), we will have more chunks to fetch, so why not perform CPU materializations in parallel and asynchronously send chunks to GPU? Of course, we wont achieve a perfect scaling due to non-data-transfer-related synchronization points (e.g., in buffer manager), but the effect is still visible. This solution uses tbb::task_arena limitedArena(16), no noticeable benefit beyond this number was observed.
What about fetching data from GPU?
There is no much benefit in reorganizing data transfers from GPU in an asynchronous fashion since we do not expect to do as "much" in between transfers on the CPU side as we do while loading data to GPU. Maybe someone will correct me.
Measurements
Taxi benchmark, 100 million rows. PVC + 128 cores CPU.
Fully on GPU, 256 fragments. Read values as speedup multiplier.
Setup
Q1 fetching
Q2 fetching
Q3 fetching
Q4 fetching
End-to-End
1 thread
2
1
1.56
1.1
1.1
limitedArena(8)
2
3.3
4.68
3.8
1.32
limitedArena(16)
1.25
5
6.15
5.9
1.42
limitedArena(24)
1.25
4.23
4.38
3.65
1.4
50% on GPU, 50% on CPU, 256 fragments.
Setup
End-to-End
1 thread
no changes
limitedArena(8)
1.31
limitedArena(16)
1.34
limitedArena(24)
1.38
Even for default fragment size for GPU-only mode (30 mil.) we can see a speedup:
Fully on GPU, 4 fragments:
Setup
Q1 fetching
Q2 fetching
Q3 fetching
Q4 fetching
End-to-End
1 thread
1
1.35
1.25
1.1
1.11
limitedArena(8)
1.23
2.47
2.28
2.66
1.26
Of course, the benefit vanishes the less we have to do on CPU between data transfers. E.g., for zero-copy columns the best-case speedup was 1.2x. Btw, is there something we could move to after fetchChunks(), but before prepareKernelParams()?
What about CUDA devices?
It is possible, the upper bound is 2x faster data transfer (i.e., pinned vs non-pinned). One needs to inform CUDA about the fact that malloc'ed CPU buffers (e.g., on slab level) are pinned, we can use cuMemHostRegister().
But
The time CUDA needs to update its page tables almost matches the data transfer time (registering one CPU slab costs ~300ms with cuMemHostRegister()vs <2 ms without). So overall we get the same time as in synchronous mode. That is, instead of waiting while CUDA uses intermediate page locked buffers for transfers to GPU from CPU pageable buffers (SYNC case), we will wait until it finishes updating its page tables (ASYNC case).
Both SYNC and ASYNC cases are linear to data size and the ASYNC one only makes sense if we get to the point of evictions (to leverage subsequent accelerated data transfer from "pinned" slabs), but then we are likely to suffer more from the evictions themselves anyways.
Additionally, not all CPU slabs may need to be registered (more complex logic required), calling cuMemHostRegister() on column chunk level is too expensive.
Apart from that, if we crash, the mapping may persist and we will have problems with unregistering those memory regions to register the new ones in the next run.
Moreover, calls to cuMemHostUnregister() are also linear to data size and in fact have proven to be even slower than cuMemHostRegister().
This PR introduces an asynchronous (batched) data fetching for L0 GPUs. Its purpose is to reduce end-to-end execution time of a workload.
Why?
We have recursive materializations (from Disk to CPU to GPU), once CPU has its buffer materialized we begin a data transfer and wait for its completion, but why should we? We can proceed to materialization of the next buffer and only care about all of the data being on GPU exactly before the kernel execution. This way we can overlap
memcpy
for CPU buffers with GPU data transfer and we don't lose on anything. Additionally, we also hide the latencies caused by the buffer manager which are constant and their impact grows linearly to the fragment count.How?
We will batch transfers into a command list until the it reaches 128MB worth of data, then we actually execute a transfer. Right after sending all of the kernel parameters (that is, right before kernel execution) we wait until the data transfers are finished (barrier). Once the transfers are done, we want to keep and recycle the command lists to avoid the overhead of creating new ones or deleting them. Which is again a constant overhead that grows linearly to the fragment count.
Why this design?
L0 has something called an "immediate command list" that should seemingly do what we want, however, the documentation says that it "may be synchronous". Indeed, on PVC it is displaying synchronous behavior, but asynchronous on Arc GPUs. The proposed solution is asynchronous for both Arc and PVC. The 128MB granularity is arbitrary. This design showed good scalability to fragment count and overall less overhead (measured with
ze_tracer
) compared to the current solution in an isolated L0 benchmark.Multithreaded fetching
Since we may have many fragments (e.g., many cores or we are in a heterogeneous mode), we will have more chunks to fetch, so why not perform CPU materializations in parallel and asynchronously send chunks to GPU? Of course, we wont achieve a perfect scaling due to non-data-transfer-related synchronization points (e.g., in buffer manager), but the effect is still visible. This solution uses
tbb::task_arena limitedArena(16)
, no noticeable benefit beyond this number was observed.What about fetching data from GPU?
There is no much benefit in reorganizing data transfers from GPU in an asynchronous fashion since we do not expect to do as "much" in between transfers on the CPU side as we do while loading data to GPU. Maybe someone will correct me.
Measurements
Taxi benchmark, 100 million rows. PVC + 128 cores CPU.
Of course, the benefit vanishes the less we have to do on CPU between data transfers. E.g., for zero-copy columns the best-case speedup was 1.2x. Btw, is there something we could move to after
fetchChunks()
, but beforeprepareKernelParams()
?What about CUDA devices?
It is possible, the upper bound is 2x faster data transfer (i.e., pinned vs non-pinned). One needs to inform CUDA about the fact that malloc'ed CPU buffers (e.g., on slab level) are pinned, we can use
cuMemHostRegister()
. ButcuMemHostRegister()
vs <2 ms without). So overall we get the same time as in synchronous mode. That is, instead of waiting while CUDA uses intermediate page locked buffers for transfers to GPU from CPU pageable buffers (SYNC case), we will wait until it finishes updating its page tables (ASYNC case).cuMemHostRegister()
on column chunk level is too expensive.cuMemHostUnregister()
are also linear to data size and in fact have proven to be even slower thancuMemHostRegister()
.