maikel / senders-io

An adaption of Senders/Receivers for async networking and I/O
Apache License 2.0
14 stars 2 forks source link

Simple implementation of read_batched #53

Closed maikel closed 1 year ago

maikel commented 1 year ago

For @mfbalin

maikel commented 1 year ago

The default behavior of this implementation dynamically allocates as many operations as needed (in sio::fork). In performance tests we might need to adjust the queue length of exec::io_uring_context in its constructor and maybe we want to limit the number of concurrent ops with something like the a sio::memory_pool

mfbalin commented 1 year ago

Thanks a lot, @maikel. My knowledge of senders currently is limited. As I work with this library and stdexec, I hope to understand them better to potentially make meaningful contributions in the future.

I will check how the performance fares shortly. Maybe I can contribute the benchmark script if you are interested in tracking the performance of the code.

maikel commented 1 year ago

@mfbalin I have tested a bit and what I have said before still holds. For performance tests, you need to adjust the size of the submission queue of io uring itself. The defaulted size of exec::io_uring_context is 1024 (randomly chosen).

It means that only 1024 operation are running concurrently. I have adjusted the test to also pass the submission queue length with the following results.

➜  build git:(read_batched) ✗ ./examples/batched_reads 1024 10000000 1000000
Read 1000000 blocks of sizes upto 2048 bytes in time 63.0135s for an average of 15869.6 IOPS and an average copy rate of 0.015164 GiB/s
➜  build git:(read_batched) ✗ ./examples/batched_reads 32000 10000000 1000000
Read 1000000 blocks of sizes upto 2048 bytes in time 4.41948s for an average of 226271 IOPS and an average copy rate of 0.216211 GiB/s

I think that's much better and I have to check the specs of my system to be able to conclude something. We are looking at read random access performance stats, right?

maikel commented 1 year ago

Aligning the reads to a boundary of 4kB might also increase performance. I will investigate this later

maikel commented 1 year ago

New numbers

➜  build git:(read_batched) ✗ ./examples/batched_reads 32000 10000000 1000000 ./ 
Read 1000000 blocks of size 4096 bytes in time 2.25525s for an average of 443410 IOPS and an average copy rate of 1.69147 GiB/s

Im still thinking about using IORING_SETUP_IOPOLL and O_DIRECT. Something doesn't work out how I want it in that case.

maikel commented 1 year ago

Disabling page caching by using O_DIRECT gives more realistic numbers, I think. We end up with

➜  build git:(read_batched) ✗ ./examples/batched_reads 32000 10000000 1000000 ./ 
Read 1000000 blocks of size 4096 bytes in time 15.4618s for an average of 64675.6 IOPS and an average copy rate of 0.246718 GiB/s
➜  build git:(read_batched) ✗ ./examples/batched_reads 32000 10000000 1000000    
Read 1000000 blocks of size 4096 bytes in time 2.73355s for an average of 365825 IOPS and an average copy rate of 1.39551 GiB/s

Now it makes a difference whether one uses the SSD or memory backend.

mfbalin commented 1 year ago

The performance seems to be on par with fio. It looks like passing --numjobs=8 to fio increases the performance further. Is this the limit with a single thread and how can we use multiple threads to push IOPS up to 1M as fio does? Looks like even with the memory backend, a single thread can't get past 1M IOPS. What is the right pattern to use to introduce multithreading here?

The following outputs are in order: fio with a single job libaoi, fio with single job io_uring, fio with 8 jobs io_uring, this example with SSD, this example with memory.

What I don't understand is why io_uring in fio is slower than libaio.

root@a100cse:/localscratch/senders-io/build# fio --name=read_iops --directory=$TEST_DIR --size=10G --time_based --runtime=60s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=256 --rw=randread --group_reporting=1 --iodepth_batch_submit=256  --iodepth_batch_complete_max=256 --numjobs=1
read_iops: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1096MiB/s][r=281k IOPS][eta 00m:00s]
read_iops: (groupid=0, jobs=1): err= 0: pid=1345288: Mon Sep  4 14:54:21 2023
  read: IOPS=281k, BW=1099MiB/s (1153MB/s)(64.4GiB/60001msec)
    slat (usec): min=2, max=3752, avg=491.61, stdev=158.62
    clat (nsec): min=1820, max=3963.8k, avg=307751.31, stdev=298194.45
     lat (usec): min=281, max=5279, avg=799.37, stdev=273.79
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    5],
     | 30.00th=[    5], 40.00th=[    7], 50.00th=[  225], 60.00th=[  433],
     | 70.00th=[  668], 80.00th=[  676], 90.00th=[  693], 95.00th=[  701],
     | 99.00th=[  742], 99.50th=[  758], 99.90th=[  898], 99.95th=[  930],
     | 99.99th=[ 1237]
   bw (  MiB/s): min= 1084, max= 1109, per=100.00%, avg=1100.19, stdev= 5.53, samples=119
   iops        : min=277662, max=284158, avg=281648.35, stdev=1415.92, samples=119
  lat (usec)   : 2=0.01%, 4=19.42%, 10=21.51%, 20=0.05%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=16.35%, 500=9.10%, 750=32.96%, 1000=0.57%
  lat (msec)   : 2=0.03%, 4=0.01%
  cpu          : usr=11.65%, sys=88.24%, ctx=5878, majf=0, minf=66
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.2%, >=64=99.8%
     submit    : 0=0.0%, 4=7.9%, 8=0.3%, 16=0.1%, 32=0.1%, 64=16.3%, >=64=75.5%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=100.0%
     issued rwts: total=16885567,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=1099MiB/s (1153MB/s), 1099MiB/s-1099MiB/s (1153MB/s-1153MB/s), io=64.4GiB (69.2GB), run=60001-60001msec

Disk stats (read/write):
    md1: ios=17446880/169, merge=0/0, ticks=3788296/0, in_queue=3788296, util=100.00%, aggrios=4354856/24, aggrmerge=6871/17, aggrticks=937857/3, aggrin_queue=937860, aggrutil=99.94%
  nvme3n1: ios=4355803/2, merge=6669/0, ticks=921319/0, in_queue=921319, util=99.94%
  nvme0n1: ios=4355741/39, merge=6974/6, ticks=950489/6, in_queue=950496, util=99.94%
  nvme4n1: ios=4354362/47, merge=7018/64, ticks=943277/6, in_queue=943282, util=99.93%
  nvme1n1: ios=4353519/11, merge=6824/0, ticks=936343/1, in_queue=936344, util=99.91%
root@a100cse:/localscratch/senders-io/build# fio --name=read_iops --directory=$TEST_DIR --size=10G --time_based --runtime=60s --ramp_time=2s --io
engine=io_uring --direct=1 --verify=0 --bs=4K --iodepth=256 --rw=randread --group_reporting=1 --iodepth_batch_submit=256  --iodepth_batch_complet
e_max=256 --numjobs=8
read_iops: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
...
fio-3.28
Starting 8 processes
Jobs: 8 (f=8): [r(8)][100.0%][r=4660MiB/s][r=1193k IOPS][eta 00m:00s]  
read_iops: (groupid=0, jobs=8): err= 0: pid=1314430: Mon Sep  4 14:49:57 2023
  read: IOPS=1190k, BW=4649MiB/s (4875MB/s)(272GiB/60010msec)
    slat (nsec): min=680, max=8364.5k, avg=27639.95, stdev=67220.44
    clat (nsec): min=120, max=55072k, avg=1688812.10, stdev=4017191.72
     lat (usec): min=39, max=55075, avg=1716.45, stdev=4013.85
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[   85], 10.00th=[   96], 20.00th=[  116],
     | 30.00th=[  143], 40.00th=[  174], 50.00th=[  281], 60.00th=[  502],
     | 70.00th=[  783], 80.00th=[ 1483], 90.00th=[ 4113], 95.00th=[ 9634],
     | 99.00th=[21890], 99.50th=[25297], 99.90th=[30802], 99.95th=[33162],
     | 99.99th=[39584]
   bw (  MiB/s): min= 3508, max= 6170, per=100.00%, avg=4652.94, stdev=69.02, samples=952
   iops        : min=898058, max=1579697, avg=1191150.60, stdev=17668.35, samples=952
  lat (nsec)   : 250=0.01%, 500=0.01%, 750=0.08%, 1000=0.17%
  lat (usec)   : 2=0.61%, 4=0.23%, 10=0.19%, 20=0.05%, 50=0.09%
  lat (usec)   : 100=10.73%, 250=36.09%, 500=11.62%, 750=9.34%, 1000=5.50%
  lat (msec)   : 2=8.67%, 4=6.41%, 10=5.40%, 20=3.48%, 50=1.34%
  lat (msec)   : 100=0.01%
  cpu          : usr=10.81%, sys=59.12%, ctx=15818430, majf=0, minf=12897
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=92.6%, 8=5.6%, 16=1.1%, 32=0.4%, 64=0.2%, >=64=0.1%
     complete  : 0=0.0%, 4=92.6%, 8=5.6%, 16=1.1%, 32=0.4%, 64=0.2%, >=64=0.1%
     issued rwts: total=71415368,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=4649MiB/s (4875MB/s), 4649MiB/s-4649MiB/s (4875MB/s-4875MB/s), io=272GiB (293GB), run=60010-60010msec

Disk stats (read/write):
    md1: ios=73608089/158, merge=0/0, ticks=123645812/16, in_queue=123645828, util=100.00%, aggrios=18422078/22, aggrmerge=0/17, aggrticks=30930721/8, aggrin_queue=30930730, aggrutil=99.99%
  nvme3n1: ios=18423636/2, merge=0/0, ticks=12924207/0, in_queue=12924208, util=99.98%
  nvme0n1: ios=18419315/58, merge=0/65, ticks=2354708/7, in_queue=2354715, util=99.98%
  nvme4n1: ios=18423123/21, merge=0/3, ticks=106064699/26, in_queue=106064726, util=99.99%
  nvme1n1: ios=18422240/9, merge=0/0, ticks=2379271/2, in_queue=2379273, util=99.98%
root@a100cse:/localscratch/senders-io/build# fio --name=read_iops --directory=$TEST_DIR --size=10G --time_based --runtime=60s --ramp_time=2s --ioengine=io_uring --direct=1 --verify=0 --bs=4K --iodepth=256 --rw=randread --group_reporting=1 --iodepth_batch_submit=256  --iodepth_batch_complete_max=256 --numjobs=1
read_iops: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=938MiB/s][r=240k IOPS][eta 00m:00s]
read_iops: (groupid=0, jobs=1): err= 0: pid=1341439: Mon Sep  4 14:51:49 2023
  read: IOPS=238k, BW=931MiB/s (976MB/s)(54.6GiB/60002msec)
    slat (nsec): min=610, max=814765, avg=3772.50, stdev=3001.48
    clat (usec): min=799, max=4243, avg=1069.22, stdev=86.47
     lat (usec): min=857, max=4246, avg=1072.99, stdev=86.19
    clat percentiles (usec):
     |  1.00th=[  971],  5.00th=[  996], 10.00th=[ 1012], 20.00th=[ 1020],
     | 30.00th=[ 1037], 40.00th=[ 1045], 50.00th=[ 1057], 60.00th=[ 1074],
     | 70.00th=[ 1074], 80.00th=[ 1090], 90.00th=[ 1123], 95.00th=[ 1156],
     | 99.00th=[ 1614], 99.50th=[ 1663], 99.90th=[ 1745], 99.95th=[ 1778],
     | 99.99th=[ 1942]
   bw (  KiB/s): min=909979, max=975695, per=100.00%, avg=954685.22, stdev=13361.46, samples=120
   iops        : min=227494, max=243923, avg=238671.10, stdev=3340.36, samples=120
  lat (usec)   : 1000=6.12%
  lat (msec)   : 2=93.87%, 4=0.01%, 10=0.01%
  cpu          : usr=14.50%, sys=56.43%, ctx=3398558, majf=0, minf=1832
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=96.4%, 8=3.4%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.0%
     complete  : 0=0.0%, 4=96.4%, 8=3.5%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1%
     issued rwts: total=14300707,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=931MiB/s (976MB/s), 931MiB/s-931MiB/s (976MB/s-976MB/s), io=54.6GiB (58.6GB), run=60002-60002msec

Disk stats (read/write):
    md1: ios=14756473/200, merge=0/0, ticks=1412352/0, in_queue=1412352, util=99.98%, aggrios=3689196/28, aggrmerge=0/21, aggrticks=355701/4, aggrin_queue=355705, aggrutil=99.95%
  nvme3n1: ios=3689429/15, merge=0/25, ticks=357825/1, in_queue=357827, util=99.95%
  nvme0n1: ios=3688549/48, merge=0/5, ticks=356513/8, in_queue=356521, util=99.95%
  nvme4n1: ios=3689672/25, merge=0/3, ticks=351278/5, in_queue=351283, util=99.95%
  nvme1n1: ios=3689136/27, merge=0/52, ticks=357189/3, in_queue=357191, util=99.95%
root@a100cse:/localscratch/senders-io/build# fio --name=read_iops --directory=$TEST_DIR --size=10G --time_based --runtime=60s --ramp_time=2s --io
engine=io_uring --direct=1 --verify=0 --bs=4K --iodepth=256 --rw=randread --group_reporting=1 --iodepth_batch_submit=256  --iodepth_batch_complet
e_max=256 --numjobs=8
read_iops: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
...
fio-3.28
Starting 8 processes
Jobs: 8 (f=8): [r(8)][100.0%][r=4660MiB/s][r=1193k IOPS][eta 00m:00s]  
read_iops: (groupid=0, jobs=8): err= 0: pid=1314430: Mon Sep  4 14:49:57 2023
  read: IOPS=1190k, BW=4649MiB/s (4875MB/s)(272GiB/60010msec)
    slat (nsec): min=680, max=8364.5k, avg=27639.95, stdev=67220.44
    clat (nsec): min=120, max=55072k, avg=1688812.10, stdev=4017191.72
     lat (usec): min=39, max=55075, avg=1716.45, stdev=4013.85
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[   85], 10.00th=[   96], 20.00th=[  116],
     | 30.00th=[  143], 40.00th=[  174], 50.00th=[  281], 60.00th=[  502],
     | 70.00th=[  783], 80.00th=[ 1483], 90.00th=[ 4113], 95.00th=[ 9634],
     | 99.00th=[21890], 99.50th=[25297], 99.90th=[30802], 99.95th=[33162],
     | 99.99th=[39584]
   bw (  MiB/s): min= 3508, max= 6170, per=100.00%, avg=4652.94, stdev=69.02, samples=952
   iops        : min=898058, max=1579697, avg=1191150.60, stdev=17668.35, samples=952
  lat (nsec)   : 250=0.01%, 500=0.01%, 750=0.08%, 1000=0.17%
  lat (usec)   : 2=0.61%, 4=0.23%, 10=0.19%, 20=0.05%, 50=0.09%
  lat (usec)   : 100=10.73%, 250=36.09%, 500=11.62%, 750=9.34%, 1000=5.50%
  lat (msec)   : 2=8.67%, 4=6.41%, 10=5.40%, 20=3.48%, 50=1.34%
  lat (msec)   : 100=0.01%
  cpu          : usr=10.81%, sys=59.12%, ctx=15818430, majf=0, minf=12897
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=92.6%, 8=5.6%, 16=1.1%, 32=0.4%, 64=0.2%, >=64=0.1%
     complete  : 0=0.0%, 4=92.6%, 8=5.6%, 16=1.1%, 32=0.4%, 64=0.2%, >=64=0.1%
     issued rwts: total=71415368,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=4649MiB/s (4875MB/s), 4649MiB/s-4649MiB/s (4875MB/s-4875MB/s), io=272GiB (293GB), run=60010-60010msec

Disk stats (read/write):
    md1: ios=73608089/158, merge=0/0, ticks=123645812/16, in_queue=123645828, util=100.00%, aggrios=18422078/22, aggrmerge=0/17, aggrticks=30930721/8, aggrin_queue=30930730, aggrutil=99.99%
  nvme3n1: ios=18423636/2, merge=0/0, ticks=12924207/0, in_queue=12924208, util=99.98%
  nvme0n1: ios=18419315/58, merge=0/65, ticks=2354708/7, in_queue=2354715, util=99.98%
  nvme4n1: ios=18423123/21, merge=0/3, ticks=106064699/26, in_queue=106064726, util=99.99%
  nvme1n1: ios=18422240/9, merge=0/0, ticks=2379271/2, in_queue=2379273, util=99.98%
root@a100cse:/localscratch/senders-io/build# examples/batched_reads 32768 100000000 1000000 .
Read 1000000 blocks of size 4096 bytes in time 3.92982s for an average of 254465 IOPS and an average copy rate of 994.002 MiB/s
root@a100cse:/localscratch/senders-io/build# examples/batched_reads 32768 100000000 1000000
Read 1000000 blocks of size 4096 bytes in time 1.7697s for an average of 565066 IOPS and an average copy rate of 2207.29 MiB/s
mfbalin commented 1 year ago

Im still thinking about using IORING_SETUP_IOPOLL and O_DIRECT. Something doesn't work out how I want it in that case.

Why do we expect IORING_SETUP_IOPOLL to potentially improve the performance? As one can see, there are many reads in flight that should saturate the interface already. If we care only about throughput and not latency, will it still improve the IOPS rate?

EDIT: Does using IORING_SETUP_IOPOLL correspond to using iou-c in this paper in Figure 6: https://atlarge-research.com/pdfs/2023-cheops-iostack.pdf? In that case, it looks like IORING_SETUP_IOPOLL provides the highest throughput per thread overall.

mfbalin commented 1 year ago

I think that's much better and I have to check the specs of my system to be able to conclude something. We are looking at read random access performance stats, right?

Yes, in my case, I care about random read access performance, not necessarily aligned to 4KiB boundaries, but for this test, it is fine to align the reads so that we have a fair comparison against fio. I basically want to have a solution that can saturate the SSD with as few threads as possible. If for example there were 16 SSDs in the system, and more than 1 thread is required to saturate each, then there won't be many threads left to do other work :).

mfbalin commented 1 year ago

Looks like passing --registerfiles=1 --fixedbufs=1 to fio increases the performance by up to 15% (235k vs 272k IOPS). Is there a way to expose this functionality through the library API (registering files and buffers to io_uring)?

maikel commented 1 year ago

Looks like passing --registerfiles=1 --fixedbufs=1 to fio increases the performance by up to 15% (235k vs 272k IOPS). Is there a way to expose this functionality through the library API (registering files and buffers to io_uring)?

That's ongoing work. I'm quite sure it's possible but I have started working on it yet. See issue #23

mfbalin commented 1 year ago

I am working on a multithreaded version but I can't seem to issue multiple read_batched requests to a context, the code seems to hang. None of the examples or tests seem to demonstrate how to do that either. Do you have any hints?

maikel commented 1 year ago

If it hangs then there is a bug. Do you have a reproducer?

It should be possible to submit from multiple contexts. Driving a context is currently only from one context possible and should throw an exception.

I will implement Mt as being done in fio next

mfbalin commented 1 year ago

What kind of sender pattern could make the read_batched function seamlessly multithreaded? Do we need to implement a multithreaded context that has multiple io_uring_contexts underneath?

maikel commented 1 year ago

What kind of sender pattern could make the read_batched function seamlessly multithreaded? Do we need to implement a multithreaded context that has multiple io_uring_contexts underneath?

Yes you would need something like that

maikel commented 1 year ago

Ive started on a multi threaded example. I will push sth soonish

maikel commented 1 year ago

@mfbalin This simple implementation and interface have some drawbacks. The assumption to have each buffer preallocated is too strong. Instead the number of total offsets and buffers should not be required to be equal. I imagine having a pool of buffers that can be reused. Otherwise, it doesn't scale for a lot of offsets.

I will continue implementing a good benchmarking application similar to fio to measure things like multi-threading and future features such as this pooling ability. After having a measurement tool we will merge this and work on a scalable version of the algorithm.

OK?

mfbalin commented 1 year ago

Sounds good @maikel, I am thankful for what you have implemented so far. However, the outer facing read_batched API needs to keep the same interface, buffer-offset pairs. Opening a file in direct mode presents a big requirement, the buffers and the offsets need to be aligned to the block size. I will be doing my own caching, finer-grained than 4KB blocks, so I need direct IO. What I have in mind for a scalable version while lifting this requirement is the following while being very easy to use:

For any read_batched call, however many buffer-offset pairs there are provided by the user, the preallocated pool of blocks will be used in the read system calls. If a read in the batch doesn't fit into a single block in our pool (could be a simple ring of blocks private to each io_context and a read could be unaligned with size 9999 bytes while each block is 4KB), we can split it over multiple blocks using the async::read that takes multiple iovecs, or use consecutive blocks from the ring and have a single larger iovec. As reads are completed, we copy the bytes from the block pool to the buffers provided by the user and discard unnecessary bytes due to the alignment requirements (reads requested by the user will probably not be aligned). If we allocate all blocks in our pool as a single large buffer, then we can register this single buffer into io_uring and use parts of this same buffer for all read system calls using read calls with fixed buffers.

If we could also accelerate a single read_batched call via multiple io_contexts seamlessly by having the file opened by all io_contexts simultaneously, that would be good enough that I can start using it as my read engine in the GNN library I am building. There, I will have a single file that contains a numpy array (possibly hundreds of Gigabytes in size) and I will need multiple slices of the array [a0:b0, a1:b1, a2:b2, ...] and these slices need to be fetched as fast as possible. Multiple threads can keep requesting work from the work queue using some grain size expressed in terms of total bytes copied in a single sub-batch of the batch.

Even though I know how to implement the idea above myself, I don't quite know how to do it efficiently using senders. Are the sequence sender building blocks in this repository meant to be merged into the stdexec repository in the future? What can I do to get familiar with the senders here as quickly as possible? Also, I wouldn't want to pass all my requirements onto you, so how can I start implementing this idea? I am bound to have a lot of questions about the sender facilities here while doing so though. If the above description of API seems reasonable to you and if it is general enough that others could make use of it too, I can work on it and contribute it to this repository.

If you would like, I can tell you more about what I am building and why I am building it. To make it efficient, I need to use asynchronicity and parallelism well, I think what I am building can showcase almost all use cases of senders quite well (parallel algorithms, asynchronous IO etc.). Senders seem to be the future of asynchronicity and parallelism for C++ so I want to use them and possibly contribute by showing other people that they can build their own efficient systems with senders.

maikel commented 1 year ago

Sounds good @maikel, I am thankful for what you have implemented so far. However, the outer facing read_batched API needs to keep the same interface, buffer-offset pairs. Opening a file in direct mode presents a big requirement, the buffers and the offsets need to be aligned to the block size. I will be doing my own caching, finer-grained than 4KB blocks, so I need direct IO. What I have in mind for a scalable version while lifting this requirement is the following while being very easy to use:

Having aligned buffers and using O_DIRECT is not a requirement for using this io context. I've used it in the benchmark to measure the real IO without obfuscating the results with the kernel buffering.

I personally think that any user-facing IO should not need to disable buffering.

For any read_batched call, however many buffer-offset pairs there are provided by the user, the preallocated pool of blocks will be used in the read system calls. If a read in the batch doesn't fit into a single block in our pool (could be a simple ring of blocks private to each io_context and a read could be unaligned with size 9999 bytes while each block is 4KB), we can split it over multiple blocks using the async::read that takes multiple iovecs, or use consecutive blocks from the ring and have a single larger iovec. As reads are completed, we copy the bytes from the block pool to the buffers provided by the user and discard unnecessary bytes due to the alignment requirements (reads requested by the user will probably not be aligned). If we allocate all blocks in our pool as a single large buffer, then we can register this single buffer into io_uring and use parts of this same buffer for all read system calls using read calls with fixed buffers.

What I try to say is, that an API that requires N preallocated buffers doesn't scale if N gets very large, due to memory consumption. But if you have those buffers anyway, then it is not a concern anymore. It is better to use those directly instead of having this internal buffer ring to prevent any unnecessary copies.

If we could also accelerate a single read_batched call via multiple io_contexts seamlessly by having the file opened by all io_contexts simultaneously, that would be good enough that I can start using it as my read engine in the GNN library I am building. There, I will have a single file that contains a numpy array (possibly hundreds of Gigabytes in size) and I will need multiple slices of the array [a0:b0, a1:b1, a2:b2, ...] and these slices need to be fetched as fast as possible. Multiple threads can keep requesting work from the work queue using some grain size expressed in terms of total bytes copied in a single sub-batch of the batch.

Yes, that's possible.

Even though I know how to implement the idea above myself, I don't quite know how to do it efficiently using senders. Are the sequence sender building blocks in this repository meant to be merged into the stdexec repository in the future? What can I do to get familiar with the senders here as quickly as possible? Also, I wouldn't want to pass all my requirements onto you, so how can I start implementing this idea? I am bound to have a lot of questions about the sender facilities here while doing so though. If the above description of API seems reasonable to you and if it is general enough that others could make use of it too, I can work on it and contribute it to this repository.

I will continue pushing those sequence senders with PRs to stdexec. I've been using them as an elegant way for async destruction and composing async ranges. They are not completely my design. I've took what I got from Kirk Shoop and Lewis Baker and implemented them on my own. There are still open issues but I believe they make it easier to implement async algorithms.

If you would like, I can tell you more about what I am building and why I am building it. To make it efficient, I need to use asynchronicity and parallelism well, I think what I am building can showcase almost all use cases of senders quite well (parallel algorithms, asynchronous IO etc.). Senders seem to be the future of asynchronicity and parallelism for C++ so I want to use them and possibly contribute by showing other people that they can build their own efficient systems with senders.

Be warned, you would be the first user of this library. It began as an experiment to showcase how IO would look like. There is a proposal of Dietmar Kühl which proposes a different API without sequence senders.

mfbalin commented 1 year ago

Having aligned buffers and using O_DIRECT is not a requirement for using this io context. I've used it in the benchmark to measure the real IO without obfuscating the results with the kernel buffering.

Yes, that is true. But OS buffering is a waste of resources for applications where the caching is done manually. I am planning on having a custom caching mechanism for these reads which will be even finer-grained than blocks of 4KB, hence I want to use O_DIRECT. My goal is to maximize performance.

What I try to say is, that an API that requires N preallocated buffers doesn't scale if N gets very large, due to memory consumption. But if you have those buffers anyway, then it is not a concern anymore. It is better to use those directly instead of having this internal buffer ring to prevent any unnecessary copies.

That is true and in my case, I will already have them. But if you use O_DIRECT, then the buffers need to be aligned. The user might not always want to pass in aligned buffers and reads, hence there is a need for buffer pools. In my experience, OS buffering can be much slower than the O_DIRECT approach.

Be warned, you would be the first user of this library. It began as an experiment to showcase how IO would look like. There is a proposal of Dietmar Kühl which proposes a different API without sequence senders.

So long as you are welcoming for users and I can reach out for questions and issues, I am willing to step into the unknown territory. The work I am doing is part of my PhD thesis on scalable Graph Neural Network training and currently, there are no users as it is work in progress. However, if I can show that my library is much more efficient compared to existing work (https://www.dgl.ai/) by using modern C++ facilities, that could draw a lot of attention from multiple fronts, C++ community, Machine Learning community, etc. I will be making use of GPUs as well, so there will be a lot of different hardware resources (GPU, CPU, SSD, PCI-e between GPU and CPU) and my goal is to keep them all busy at the same time without waiting by making use of senders and asynchronicity.