Provide interfaces to prepare SQE in place

jiangliu commented 2 years ago

Currently the way to submit an SQE is: 1) prepare an opcode object op 2) convert the opcode into SQE by op.build() 3) submit the SQE to SQ by sq.push()/push_multiple()

It would be great to provide interfaces to prepare SQE in place to reduce one memory copy by: 1) prepare an opcode object op 2) check space left on the SQ and get available SQE by let sqe: &mut Entry = sq.prepare() 3) Prepare the SQE inplace by op.prepare(sqe) 4) commit the sqe by sq.commit()

quininer commented 2 years ago

I think memory copy is not a big overhead, because sqe only has 64bytes. see nop bench, the io_uring crate is not at a performance disadvantage because of this.

but I don't reject this idea, it requires some refactor for opcode mod.

jiangliu commented 2 years ago

I think memory copy is not a big overhead, because sqe only has 64bytes. see nop bench, the io_uring crate is not at a performance disadvantage because of this.

but I don't reject this idea, it requires some refactor for opcode mod.

It would be great if we could also introduce pub trait Opcode too, then we could pass opcode as generic types such as "pub fn prepare(...)".

And I will give a try to prepare some patches:)

quininer commented 2 years ago

This is not enough, I hope to be able to specify some initialization parameters at same time.

I would consider adding a PrepareEntry type and each opcode can initialize it.

impl SubmissionQueue<'_> {
    fn prepare(&mut self) -> PrepareEntry<'_>;
}

struct PrepareEntry<'a> {
    ptr: *mut io_uring_sqe,
    tail: &'a mut u32
}

struct InitializedEntry<'a, OPCODE> {
    ptr: *mut OPCODE,
    tail: &'a mut u32
}

// auto commit
impl Drop for InitializedEntry<_, OPCODE> {}

impl DerefMut for InitializedEntry<'a, OPCODE> {
    type Target = OPCODE;
}

#[repr(transparent)]
struct Timeout(io_uring_sqe);

impl Timeout {
    fn init<'a>(entry: PrepareEntry<'a>, timespec: *const types::Timespec) -> InitializedEntry<'a, Timeout>;
}

The current opcode always constructs a separate struct and then constructs sqe type, we need to make it directly modify sqe type.

quininer commented 2 years ago

I started to make opcode support inplace init in the branch. This seems to have some performance degradation for ZST like Nop, but it should have little effect in production.

jiangliu commented 2 years ago

I started to make opcode support inplace init in the branch. This seems to have some performance degradation for ZST like Nop, but it should have little effect in production.

I have given it a second try during the new year holiday and implemented a mechanism to prepare SQE in place without generate the intermediate OpCode structures. But the nop benchmark result is not so optimistic. With more investigations, it turns out that the nop benchmarks have been heavily optimized by the compiler.

/// Logically, this version works as
/// - generate an Nop object
/// - convert the Nop object into an SQE object
/// - copy the generated SQE object into the submission queue
pub fn normal(mut sq: SubmissionQueue<'_>) {
    unsafe { let _ = sq.push(&crate::opcode::Nop::new().build()); }
}

/// Logically this version works as:
/// - generate an Nop object
/// - convert the Nop object to SQE in the submission queue in place
pub fn prepare(mut sq: SubmissionQueue<'_>) {
    unsafe { let _ = sq.push_command(&crate::opcode::Nop::new(), None); }
}

/// Logically this version works as:
/// - generate an SQE in the submission queue in place
pub fn prepare_sqe(mut sq: SubmissionQueue<'_>) {
    unsafe {
        match sq.get_available_sqe(0) {
            Ok(sqe) => {
                let nop_sqe: &mut crate::opcode::NopSqe = sqe.into();
                nop_sqe.prepare();
                sq.move_forward(1);
            }
            Err(_) => return,
        }
    }
}

At first glance, preppare_sqe() should perform best, and normal() should be the worst. But it turns out that normal() performs best, and prepare() and prepare_sqe() is worth than normal(). It's strange enough. With more investigation, it turns out that the compiler has heavily optimized these three functions due to:

benchmark runs with release built binary
Nop is a null structure
the way to generate Nop, convert to SQE and copy SQE is very tight

So the compiler has heavily optimized these three functions, and the final asm code is almost the same

root@30a9199c95ff:/io-uring# cargo asm --build-type release --features=unstable io_uring::squeue::normal     
io_uring::squeue::normal:
 mov     rax, rsi
 shr     rax, 32
 mov     ecx, eax
 sub     ecx, esi
 cmp     dword, ptr, [rdi, +, 44], ecx
 je      .LBB53_2
 mov     rcx, qword, ptr, [rdi, +, 32]
 mov     edx, dword, ptr, [rdi, +, 40]
 and     edx, eax
 shl     rdx, 6
 movabs  rsi, -4294967296
 mov     qword, ptr, [rcx, +, rdx], rsi
 xorps   xmm0, xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 8], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 24], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 40], xmm0
 mov     qword, ptr, [rcx, +, rdx, +, 56], 0
 add     eax, 1
.LBB53_2:
 mov     rcx, qword, ptr, [rdi, +, 8]
 mov     dword, ptr, [rcx], eax
 ret

root@30a9199c95ff:/io-uring# cargo asm --build-type release --features=unstable io_uring::squeue::prepare
io_uring::squeue::prepare:
 mov     rax, rsi
 shr     rax, 32
 mov     ecx, eax
 sub     ecx, esi
 cmp     dword, ptr, [rdi, +, 44], ecx
 je      .LBB54_2
 mov     rcx, qword, ptr, [rdi, +, 32]
 mov     edx, dword, ptr, [rdi, +, 40]
 and     edx, eax
 shl     rdx, 6
 xorps   xmm0, xmm0
 movups  xmmword, ptr, [rcx, +, rdx], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 48], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 32], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 16], xmm0
 mov     dword, ptr, [rcx, +, rdx, +, 4], -1
 add     eax, 1
.LBB54_2:
 mov     rcx, qword, ptr, [rdi, +, 8]
 mov     dword, ptr, [rcx], eax
 ret

root@30a9199c95ff:/io-uring# cargo asm --build-type release --features=unstable io_uring::squeue::prepare_sqe
io_uring::squeue::prepare_sqe:
 mov     rax, rsi
 shr     rax, 32
 mov     ecx, eax
 sub     ecx, esi
 cmp     dword, ptr, [rdi, +, 44], ecx
 je      .LBB55_2
 mov     rcx, qword, ptr, [rdi, +, 32]
 mov     edx, dword, ptr, [rdi, +, 40]
 and     edx, eax
 shl     rdx, 6
 xorps   xmm0, xmm0
 movups  xmmword, ptr, [rcx, +, rdx], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 48], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 32], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 16], xmm0
 mov     dword, ptr, [rcx, +, rdx, +, 4], -1
 add     eax, 1
.LBB55_2:
 mov     rcx, qword, ptr, [rdi, +, 8]
 mov     dword, ptr, [rcx], eax
 ret

They are almost the same, but there's still a minor difference in the core part. Let get rid of the prolog and epilog part, which are the same. Then we get

normal:
 movabs  rsi, -4294967296
 mov     qword, ptr, [rcx, +, rdx], rsi
 xorps   xmm0, xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 8], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 24], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 40], xmm0
 mov     qword, ptr, [rcx, +, rdx, +, 56], 0

prepare & prepare_sqe:
xorps   xmm0, xmm0
 movups  xmmword, ptr, [rcx, +, rdx], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 48], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 32], xmm0
 movups  xmmword, ptr, [rcx, +, rdx, +, 16], xmm0
 mov     dword, ptr, [rcx, +, rdx, +, 4], -1

Logically the two versions of asm code above has the same effect, but the the normal() version performance slightly better.

So when dealing with more complex opcode and compiler can't heavily optimize normal(), the prepare_sqe() should have performance benefits.

ming1 commented 1 year ago

I think memory copy is not a big overhead, because sqe only has 64bytes. see nop bench, the io_uring crate is not at a performance disadvantage because of this.

Hello,

I observed that ublk-null[1] built over tokio-rs/io-uring performs less ~10% than C version's miniublk[2], 1.7M vs. 1.9M wrt. IOPS when running:

fio/t/io_uring /dev/ublkb0

/dev/ublkb0 is created by ublk-null[1] or miniublk[2].

ublk-null basically just takes io-uring cmd(16) for communicating with driver, and the logic is translated from C version, and the handling is pretty simple, all implementation in fast path is in UblkQueue::process_io(). and UblkQueue::complete_io() <- queue_io() of libublk::UblkQueueImpl for NullQueue.

And I guess the gap might be from the SQE copy.

[1] https://github.com/ming1/libublk-rs [2] https://github.com/osandov/blktests/blob/master/src/miniublk.c

quininer commented 1 year ago

@ming1 I'm a bit skeptical about this, your code has a lot of borrow, borrow_mut and has extra heap allocations in reap_events_uring. it would be more useful if you could give a detailed performance analysis.

ming1 commented 1 year ago

Hello @quininer

Thanks for the response!

All borrow() and borrow_mut() are done in single same thread context wrt. fast IO path, as one Rust beginner, I understand the two just adds runtime borrow checking, which shouldn't be expensive enough, which is just for interior mutability.

Otherwise, q.ops(UblkQueueImpl trait object) has to pass as function parameter to most methods of UblkQueue.

The heap allocation in reap_events_uring() can be removed easily, which is still for handling borrowing check. I just tried to remove it, and not see obvious performance difference.

Thanks,

quininer commented 1 year ago

It's hard to determine the problem by just guess, I'm just pointing out some differences between your code and C version. It would be useful to have an analysis based on tools such as perf.

ming1 commented 1 year ago

It's hard to determine the problem by just guess, I'm just pointing out some differences between your code and C version. It would be useful to have an analysis based on tools such as perf.

@quininer

Yeah, I agree.

Today I wrote one small test code against io-uring craft to test IO in similar way with fio/t/io_uring. And the result shows that IOPS is basically same under same setting.

So I start to investigate from ublk side, turns out the gap is mainly from unaligned buffer, another effect could be from one command buffer move.

After addressing this two things, now libublk can perform as well as miniublk in blktests project. And I have committed the fixes to libublk-rs already.

Thanks

tokio-rs / io-uring

Provide interfaces to prepare SQE in place #116