Open jiangliu opened 2 years ago
I think memory copy is not a big overhead, because sqe only has 64bytes. see nop bench, the io_uring crate is not at a performance disadvantage because of this.
but I don't reject this idea, it requires some refactor for opcode mod.
I think memory copy is not a big overhead, because sqe only has 64bytes. see nop bench, the io_uring crate is not at a performance disadvantage because of this.
but I don't reject this idea, it requires some refactor for opcode mod.
It would be great if we could also introduce pub trait Opcode
too, then we could pass opcode as generic types such as "pub fn prepare
And I will give a try to prepare some patches:)
This is not enough, I hope to be able to specify some initialization parameters at same time.
I would consider adding a PrepareEntry
type and each opcode can initialize it.
impl SubmissionQueue<'_> {
fn prepare(&mut self) -> PrepareEntry<'_>;
}
struct PrepareEntry<'a> {
ptr: *mut io_uring_sqe,
tail: &'a mut u32
}
struct InitializedEntry<'a, OPCODE> {
ptr: *mut OPCODE,
tail: &'a mut u32
}
// auto commit
impl Drop for InitializedEntry<_, OPCODE> {}
impl DerefMut for InitializedEntry<'a, OPCODE> {
type Target = OPCODE;
}
#[repr(transparent)]
struct Timeout(io_uring_sqe);
impl Timeout {
fn init<'a>(entry: PrepareEntry<'a>, timespec: *const types::Timespec) -> InitializedEntry<'a, Timeout>;
}
The current opcode always constructs a separate struct and then constructs sqe type, we need to make it directly modify sqe type.
I started to make opcode support inplace init in the branch. This seems to have some performance degradation for ZST like Nop
, but it should have little effect in production.
I started to make opcode support inplace init in the branch. This seems to have some performance degradation for ZST like
Nop
, but it should have little effect in production.
I have given it a second try during the new year holiday and implemented a mechanism to prepare SQE in place without generate the intermediate OpCode structures. But the nop benchmark result is not so optimistic. With more investigations, it turns out that the nop benchmarks have been heavily optimized by the compiler.
/// Logically, this version works as
/// - generate an Nop object
/// - convert the Nop object into an SQE object
/// - copy the generated SQE object into the submission queue
pub fn normal(mut sq: SubmissionQueue<'_>) {
unsafe { let _ = sq.push(&crate::opcode::Nop::new().build()); }
}
/// Logically this version works as:
/// - generate an Nop object
/// - convert the Nop object to SQE in the submission queue in place
pub fn prepare(mut sq: SubmissionQueue<'_>) {
unsafe { let _ = sq.push_command(&crate::opcode::Nop::new(), None); }
}
/// Logically this version works as:
/// - generate an SQE in the submission queue in place
pub fn prepare_sqe(mut sq: SubmissionQueue<'_>) {
unsafe {
match sq.get_available_sqe(0) {
Ok(sqe) => {
let nop_sqe: &mut crate::opcode::NopSqe = sqe.into();
nop_sqe.prepare();
sq.move_forward(1);
}
Err(_) => return,
}
}
}
At first glance, preppare_sqe()
should perform best, and normal()
should be the worst. But it turns out that normal()
performs best, and prepare()
and prepare_sqe()
is worth than normal()
. It's strange enough. With more investigation, it turns out that the compiler has heavily optimized these three functions due to:
So the compiler has heavily optimized these three functions, and the final asm code is almost the same
root@30a9199c95ff:/io-uring# cargo asm --build-type release --features=unstable io_uring::squeue::normal
io_uring::squeue::normal:
mov rax, rsi
shr rax, 32
mov ecx, eax
sub ecx, esi
cmp dword, ptr, [rdi, +, 44], ecx
je .LBB53_2
mov rcx, qword, ptr, [rdi, +, 32]
mov edx, dword, ptr, [rdi, +, 40]
and edx, eax
shl rdx, 6
movabs rsi, -4294967296
mov qword, ptr, [rcx, +, rdx], rsi
xorps xmm0, xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 8], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 24], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 40], xmm0
mov qword, ptr, [rcx, +, rdx, +, 56], 0
add eax, 1
.LBB53_2:
mov rcx, qword, ptr, [rdi, +, 8]
mov dword, ptr, [rcx], eax
ret
root@30a9199c95ff:/io-uring# cargo asm --build-type release --features=unstable io_uring::squeue::prepare
io_uring::squeue::prepare:
mov rax, rsi
shr rax, 32
mov ecx, eax
sub ecx, esi
cmp dword, ptr, [rdi, +, 44], ecx
je .LBB54_2
mov rcx, qword, ptr, [rdi, +, 32]
mov edx, dword, ptr, [rdi, +, 40]
and edx, eax
shl rdx, 6
xorps xmm0, xmm0
movups xmmword, ptr, [rcx, +, rdx], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 48], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 32], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 16], xmm0
mov dword, ptr, [rcx, +, rdx, +, 4], -1
add eax, 1
.LBB54_2:
mov rcx, qword, ptr, [rdi, +, 8]
mov dword, ptr, [rcx], eax
ret
root@30a9199c95ff:/io-uring# cargo asm --build-type release --features=unstable io_uring::squeue::prepare_sqe
io_uring::squeue::prepare_sqe:
mov rax, rsi
shr rax, 32
mov ecx, eax
sub ecx, esi
cmp dword, ptr, [rdi, +, 44], ecx
je .LBB55_2
mov rcx, qword, ptr, [rdi, +, 32]
mov edx, dword, ptr, [rdi, +, 40]
and edx, eax
shl rdx, 6
xorps xmm0, xmm0
movups xmmword, ptr, [rcx, +, rdx], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 48], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 32], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 16], xmm0
mov dword, ptr, [rcx, +, rdx, +, 4], -1
add eax, 1
.LBB55_2:
mov rcx, qword, ptr, [rdi, +, 8]
mov dword, ptr, [rcx], eax
ret
They are almost the same, but there's still a minor difference in the core part. Let get rid of the prolog and epilog part, which are the same. Then we get
normal:
movabs rsi, -4294967296
mov qword, ptr, [rcx, +, rdx], rsi
xorps xmm0, xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 8], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 24], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 40], xmm0
mov qword, ptr, [rcx, +, rdx, +, 56], 0
prepare & prepare_sqe:
xorps xmm0, xmm0
movups xmmword, ptr, [rcx, +, rdx], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 48], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 32], xmm0
movups xmmword, ptr, [rcx, +, rdx, +, 16], xmm0
mov dword, ptr, [rcx, +, rdx, +, 4], -1
Logically the two versions of asm code above has the same effect, but the the normal() version performance slightly better.
So when dealing with more complex opcode and compiler can't heavily optimize normal(), the prepare_sqe() should have performance benefits.
I think memory copy is not a big overhead, because sqe only has 64bytes. see nop bench, the io_uring crate is not at a performance disadvantage because of this.
Hello,
I observed that ublk-null[1] built over tokio-rs/io-uring performs less ~10% than C version's miniublk[2], 1.7M vs. 1.9M wrt. IOPS when running:
fio/t/io_uring /dev/ublkb0
/dev/ublkb0 is created by ublk-null[1] or miniublk[2].
ublk-null basically just takes io-uring cmd(16) for communicating with driver, and the logic is translated from C version, and the handling is pretty simple, all implementation in fast path is in UblkQueue::process_io(). and UblkQueue::complete_io() <- queue_io() of libublk::UblkQueueImpl for NullQueue.
And I guess the gap might be from the SQE copy.
[1] https://github.com/ming1/libublk-rs [2] https://github.com/osandov/blktests/blob/master/src/miniublk.c
@ming1 I'm a bit skeptical about this, your code has a lot of borrow
, borrow_mut
and has extra heap allocations in reap_events_uring
. it would be more useful if you could give a detailed performance analysis.
Hello @quininer
Thanks for the response!
All borrow() and borrow_mut() are done in single same thread context wrt. fast IO path, as one Rust beginner, I understand the two just adds runtime borrow checking, which shouldn't be expensive enough, which is just for interior mutability.
Otherwise, q.ops(UblkQueueImpl trait object) has to pass as function parameter to most methods of UblkQueue.
The heap allocation in reap_events_uring() can be removed easily, which is still for handling borrowing check. I just tried to remove it, and not see obvious performance difference.
Thanks,
It's hard to determine the problem by just guess, I'm just pointing out some differences between your code and C version.
It would be useful to have an analysis based on tools such as perf
.
It's hard to determine the problem by just guess, I'm just pointing out some differences between your code and C version. It would be useful to have an analysis based on tools such as
perf
.
@quininer
Yeah, I agree.
Today I wrote one small test code against io-uring craft to test IO in similar way with fio/t/io_uring. And the result shows that IOPS is basically same under same setting.
So I start to investigate from ublk side, turns out the gap is mainly from unaligned buffer, another effect could be from one command buffer move.
After addressing this two things, now libublk can perform as well as miniublk in blktests project. And I have committed the fixes to libublk-rs already.
Thanks
Currently the way to submit an SQE is: 1) prepare an opcode object
op
2) convert the opcode into SQE by op.build() 3) submit the SQE to SQ by sq.push()/push_multiple()It would be great to provide interfaces to prepare SQE in place to reduce one memory copy by: 1) prepare an opcode object
op
2) check space left on the SQ and get available SQE bylet sqe: &mut Entry = sq.prepare()
3) Prepare the SQE inplace byop.prepare(sqe)
4) commit the sqe by sq.commit()