intel / hdk

A low-level execution library for analytic data processing.
Apache License 2.0
31 stars 14 forks source link

[L0] Asynchronous data fetching #711

Open akroviakov opened 1 year ago

akroviakov commented 1 year ago

This PR introduces an asynchronous (batched) data fetching for L0 GPUs. Its purpose is to reduce end-to-end execution time of a workload.


Why?

We have recursive materializations (from Disk to CPU to GPU), once CPU has its buffer materialized we begin a data transfer and wait for its completion, but why should we? We can proceed to materialization of the next buffer and only care about all of the data being on GPU exactly before the kernel execution. This way we can overlap memcpy for CPU buffers with GPU data transfer and we don't lose on anything. Additionally, we also hide the latencies caused by the buffer manager which are constant and their impact grows linearly to the fragment count.


How?

We will batch transfers into a command list until the it reaches 128MB worth of data, then we actually execute a transfer. Right after sending all of the kernel parameters (that is, right before kernel execution) we wait until the data transfers are finished (barrier). Once the transfers are done, we want to keep and recycle the command lists to avoid the overhead of creating new ones or deleting them. Which is again a constant overhead that grows linearly to the fragment count.

Why this design?

L0 has something called an "immediate command list" that should seemingly do what we want, however, the documentation says that it "may be synchronous". Indeed, on PVC it is displaying synchronous behavior, but asynchronous on Arc GPUs. The proposed solution is asynchronous for both Arc and PVC. The 128MB granularity is arbitrary. This design showed good scalability to fragment count and overall less overhead (measured with ze_tracer) compared to the current solution in an isolated L0 benchmark.


Multithreaded fetching

Since we may have many fragments (e.g., many cores or we are in a heterogeneous mode), we will have more chunks to fetch, so why not perform CPU materializations in parallel and asynchronously send chunks to GPU? Of course, we wont achieve a perfect scaling due to non-data-transfer-related synchronization points (e.g., in buffer manager), but the effect is still visible. This solution uses tbb::task_arena limitedArena(16), no noticeable benefit beyond this number was observed.


What about fetching data from GPU?

There is no much benefit in reorganizing data transfers from GPU in an asynchronous fashion since we do not expect to do as "much" in between transfers on the CPU side as we do while loading data to GPU. Maybe someone will correct me.


Measurements

Taxi benchmark, 100 million rows. PVC + 128 cores CPU.


Fully on GPU, 256 fragments. Read values as speedup multiplier. Setup Q1 fetching Q2 fetching Q3 fetching Q4 fetching End-to-End
1 thread 2 1 1.56 1.1 1.1
limitedArena(8) 2 3.3 4.68 3.8 1.32
limitedArena(16) 1.25 5 6.15 5.9 1.42
limitedArena(24) 1.25 4.23 4.38 3.65 1.4
50% on GPU, 50% on CPU, 256 fragments. Setup End-to-End
1 thread no changes
limitedArena(8) 1.31
limitedArena(16) 1.34
limitedArena(24) 1.38
Even for default fragment size for GPU-only mode (30 mil.) we can see a speedup: Fully on GPU, 4 fragments: Setup Q1 fetching Q2 fetching Q3 fetching Q4 fetching End-to-End
1 thread 1 1.35 1.25 1.1 1.11
limitedArena(8) 1.23 2.47 2.28 2.66 1.26

Of course, the benefit vanishes the less we have to do on CPU between data transfers. E.g., for zero-copy columns the best-case speedup was 1.2x. Btw, is there something we could move to after fetchChunks(), but before prepareKernelParams()?


What about CUDA devices?

It is possible, the upper bound is 2x faster data transfer (i.e., pinned vs non-pinned). One needs to inform CUDA about the fact that malloc'ed CPU buffers (e.g., on slab level) are pinned, we can use cuMemHostRegister(). But

kurapov-peter commented 11 months ago

There seems to be a build error on win btw: https://github.com/intel-ai/hdk/actions/runs/6852649791/job/18969702088?pr=711#step:10:1064