Open wjones127 opened 11 months ago
+1 although I don't think we can replace buffered
with try_join_all
because we still need the streaming interface (e.g. try_join_all, on a stream of batches, would load all batches in memory). We can replace it with something that has the same semantics but adds all calls to the I/O scheduler though.
What's the scope of the scheduler? 1 per operation? 1 per dataset? 1 per process?
What's the scope of the scheduler? 1 per operation? 1 per dataset? 1 per process?
I think I lean towards either 1 per dataset or 1 per process. For ease of implementation, I think it makes most sense to wrap an object store.
WDYT?
Both 1 per dataset and 1 per process sound good to me. If I had to pick I'd go with 1 per process since the I/O constraints (memory, parallel I/O) are probably going to be machine-wide anyways. I'm not quite sure what it would look like wrapped around an object_store. Would you just use buffered
and give it a really large count (e.g. instead of num_cpus something like MAX_SCHEDULED_TASKS
)
Motivation
Right now, our IO parallelism is choppy and inconsistent. This is essentially due to three issues: ad-hoc parallelism settings, no queuing of IO tasks, and CPU-tasks blocking IO tasks.
Right now, our IO parallelism is determined ad-hoc as we call
.buffered()
and.buffer_unordered()
throughout the codebase. This leads to a sort of tree where we readN
fragments in parallel,N * M
batches in parallel (each fragment readingM
batches in parallel), andN * M * K
columns in parallel. Each level of parallelism is usually set asnum_cpus::get()
ornum_cpus::get() * 4
, which is sometimes adequate for machines with large CPUs but often inadequate for smaller machines. This is a very indirect way to control the amount of allowed IO parallelism, and is not consistent throughput all code paths.The
.buffered()
calls also cause the amount of IO parallelism to vary over time. After the first round of tasks, the system must wait for the first fragment's tasks to finish before a new fragment's worth of tasks can start. This causes a sort of sawtooth pattern in IO utilization:In addition, we often interleave IO tasks and CPU-bound tasks. For example, while reading batches of data, we will apply filters or masks as we read. This often blocks new IO tasks from starting.
Solution
We can move IO into a separate scheduler:
.buffered()
, we can simply create an entire vector of futures and usetry_join_all()
to wait for all of them to finish.The right user facing knobs
There are three resources of concern here:
Right now, the
{batch,fragment}_readahead}
control both (1), (2), and (3). With the scheduler, we can separate them into a more global settingmax_parallel_io
, which directly controls (2), and{batch,fragment}_readahead
, which will control (1) and (3).Potential extensions
By bringing IO calls to a more central location, we can also do more intelligent management of IO. This includes: