ublk-org / ublksrv

ublk: userspace block device driver
MIT License
138 stars 47 forks source link

ublkdrv supporting zero copy #64

Open kyleshu opened 6 months ago

kyleshu commented 6 months ago

This might not be the right place to ask this question. But I saw you are working on some solution. @ming1 I am comparing two approaches of using a TCP NVMe-oF target from a host. 1) directly attach it to kernel 2) attach it to an SPDK application and expose it to kernel through ublkdrv When I run 4KB QD1 random write on them, the second approach shows an additional ~20us average latency and 50x tail latency (110 vs 5900 p99.99). I suspect most of the overhead comes from the memory copy and can be avoided with a zero-copy implementation. Do you have a working prototype of the zero-copy ublk driver I can try?

ming1 commented 6 months ago

Last year, I posted out zero copy patches[1], but not got accepted.

The biggest concern is that the interface is too specific.

I plan to restart the work this year after further thinking & investigation.

[1] https://lore.kernel.org/linux-block/20230330113630.1388860-1-ming.lei@redhat.com/

ming1 commented 6 months ago

When I run 4KB QD1 random write on them, the second approach >shows an additional ~20us average latency and 50x tail latency (110 vs 5900 p99.99).

zero copy usually makes a bit big difference for big sized IO, I remembered that difference starts to be observed since 64K IO.

For 4K IO, one time copy shouldn't have big effect.

I guess it is because of QD1.

The communication cost for QD1 can't be neglected, and ublk is supposed to perform well in high QD case.

tiagolobocastro commented 5 months ago

Hi, I've also been experimenting with ublk recently.

If I use spdk to expose and nvme device over ublk, I find I get IOPS=54.3k, BW=212MiB/s lat (usec): min=16, max=125, avg=17.79, stdev= 1.65 vs the raw device IOPS=89.7k, BW=350MiB/s lat (usec): min=9, max=124, avg=10.75, stdev= 1.99

With 16QD I get IOPS=209k, BW=817MiB/s lat (usec): min=46, max=11941, avg=76.22, stdev=332.83

vs IOPS=510k, BW=1993MiB/s lat (usec): min=11, max=171, avg=31.20, stdev= 2.56

Are these the results which you'd expect to see?

ming1 commented 5 months ago

Hi, I've also been experimenting with ublk recently.

If I use spdk to expose and nvme device over ublk, I find I get IOPS=54.3k, BW=212MiB/s lat (usec): min=16, max=125, avg=17.79, stdev= 1.65 vs the raw device IOPS=89.7k, BW=350MiB/s lat (usec): min=9, max=124, avg=10.75, stdev= 1.99

With 16QD I get IOPS=209k, BW=817MiB/s lat (usec): min=46, max=11941, avg=76.22, stdev=332.83

vs IOPS=510k, BW=1993MiB/s lat (usec): min=11, max=171, avg=31.20, stdev= 2.56

Are these the results which you'd expect to see?

No, definitely no, and the gap isn't supposed to be so big, at least for 16QD.

What is the result when you run test on ublk-loop?

BTW, the performance improvement is in-progress:

1) zero copy support

2) bpf support

The final goal is to align ublk perf with kernel driver, or the gap is small enough.

Thanks,

tiagolobocastro commented 5 months ago

What is the result when you run test on ublk-loop?

sudo $ublk add -t loop -f /dev/nvme0n1 With 16QD I get IOPS=223k, BW=871MiB/s lat (usec): min=25, max=447, avg=71.56

So a little better than my SPDK device, but not much.

The final goal is to align ublk perf with kernel driver, or the gap is small enough.

That would be great! Thanks for all your efforts on this btw, it's awesome! Let me know if you ever need some testing.

Thanks