ikorennoy / jasyncfio

Java asynchronous file I/O based on io_uring Linux interface
Apache License 2.0
71 stars 10 forks source link

Fixed-buffer performance and usecases? #64

Closed GavinRay97 closed 1 year ago

GavinRay97 commented 1 year ago

I'm curious about when you might want to use .readFixedBuffer().

From reading, it seems like it could be ideal if you know you're dealing with fixed buffer sizes.

Are there any drawbacks to using it?

I'm experimenting with using jasyncfio for I/O in a database, where the usage looks something like this. Does it make sense to use that here? (Also, why is there no writeFixedBuffer() out of curiosity?)

Thank you =)

const val PAGE_SIZE = 4096

class AsyncDiskManager(dbFile: Path) : IDiskManager {
    private val eventExecutor = EventExecutor.builder().withBufRing(1024, PAGE_SIZE).build()
    private val asyncFile = AsyncFile.open(dbFile, eventExecutor).get()

    suspend fun readPage(pageId: Long, buffer: MemorySegment) = withContext(Dispatchers.IO) {
        asyncFile.read(buffer.asByteBuffer(), pageId * PAGE_SIZE).await()
    }

    suspend fun readPageFixedBuffer(pageId: Long, buffer: MemorySegment) = withContext(Dispatchers.IO) {
        asyncFile.readFixedBuffer(pageId * PAGE_SIZE).await().use {
            buffer.asByteBuffer().put(it.buffer)
        }
    }

    suspend fun writePage(pageId: Long, buffer: MemorySegment): Int = withContext(Dispatchers.IO) {
        asyncFile.write(buffer.asByteBuffer(), pageId * PAGE_SIZE).await()
    }
}
ikorennoy commented 1 year ago

Hi! Fixed buffer is a natural way of using buffers with the io_uring completeness model and can improve memory usage efficiency. You can find more details in Jens Axboe latest talk on io_uring: Kernel Recipes 2022 - What’s new with io_uring

Please note that my library implements only the second version of the Fixed buffer API, also called BufRing. It works starting with the 5.19 Linux Kernel.

About writeFixedBuffer, the idea is that when the buffer ring is created, all buffers are in the ownership of the Kernel, and it could choose any of them for the subsequent write request and corrupt any data you've put into the buffer. But when the readFixedBuffer request is made, the kernel transfer buffer ownership to the application, and thus you can use the buffer for the subsequent write request and then return the ownership to the Kernel. It's a brief explanation if you have questions feel free to ask :)

Also, I've noticed you use Kotlin. Jasyncfio targets Java 8+ and uses CompletableFuture. Mixin CompletableFuture with coroutines introduces unnecessary overheads into your application. I have a library targeting Kotlin coroutines. It uses the same API and the same internal implementation but without CompletableFuture. Currently, it is in the private repository, and I don't have public releases (I'm trying to identify hot spots and optimize them). Still, it matures as Jasyncfio (if 'mature' at all can be used :)), and you can definitely use it for your experiments. The only thing is it does not have a maven publication even to maven local, but if you have an interest, I'll add it. Also, the good thing is it performs better than CompletableFuture API and has a small benchmark harness so you can try different io_uring configurations and get some numbers :)

So if you want, let me know, and I'll add you to the private repo or even open it.

GavinRay97 commented 1 year ago

But when the readFixedBuffer request is made, the kernel transfer buffer ownership to the application, and thus you can use the buffer for the subsequent write request and then return the ownership to the Kernel.

Ohh, I see, thank you.

This is an interesting behavior. In a database buffer pool, typically you allocate a memory arena and then pass slices of the buffer to your disk I/O functions to be read/written into:

I wonder if you could skip passing buffer to io_uring call, and instead share the buffers from the BufferPoolManager directly with io_uring using fixed buffers πŸ€”

Also, I've noticed you use Kotlin. Jasyncfio targets Java 8+ and uses CompletableFuture. Mixin CompletableFuture with coroutines introduces unnecessary overheads into your application.

Ah, I am not very familiar with concurrent programming and I noticed in my benchmarks that there seemed to be little benefit from the Kotlin wrappers with jasyncfio compared to blocking/sequential, single-threaded I/O.

Maybe I am writing bad code though, I wanted to have a thread-per-core doing I/O, with each thread having potentially many coroutines.

// Read 200 random pages synchronously
val syncTime = measureTimeMillis {
    for (i in 0 until 200) {
        val pageId = Random.nextLong(NUM_PAGES.toLong())
        val pageBuffer = bufferPoolManager.getOrLoadPageSync(pageId)
    }
}
println("Sync time: $syncTime ms")

// Read 200 random pages asynchronously
val asyncReadTime = measureTimeMillis {
    val cpus = Runtime.getRuntime().availableProcessors()
    for (cpu in 0 until cpus) {
        launch(newSingleThreadContext("async-reader-$cpu")) {
            for (i in 0 until 200 / cpus) {
                val pageId = Random.nextLong(NUM_PAGES.toLong())
                val pageBuffer = bufferPoolManager.getOrLoadPageAsync(pageId)
            }
        }
    }
}
println("Asynchronous time: $asyncReadTime ms")
Sync time: 13 ms
Asynchronous time: 19 ms

The only thing is it does not have a maven publication even to maven local, but if you have an interest, I'll add it. So if you want, let me know, and I'll add you to the private repo or even open it.

I would definitely be interested! That sounds great. I can just add it into my own project source code for now, it's no big deal πŸ‘

Also, the good thing is it performs better than CompletableFuture API and has a small benchmark harness so you can try different io_uring configurations and get some numbers :)

πŸ™Œ πŸ₯³

ikorennoy commented 1 year ago

I wonder if you could skip passing buffer to io_uring call, and instead share the buffers from the BufferPoolManager directly with io_uring using fixed buffers

As far as I know, RocksDB and Scylla have adopted io_uring. I think it's possible to get some ideas from their implementation :)

Ah, I am not very familiar with concurrent programming, and I noticed in my benchmarks that there seemed to be little benefit from the Kotlin wrappers with jasyncfio compared to blocking/sequential, single-threaded I/O.

The benchmark part is very interesting. I've tried to write a benchmark, and it was very similar to yours. And it also shows that simple synchronous read performs better. The gap was not significant and decreased when the buffer size was bigger.

Then I tried fio benchmark and it shows that io_uring performs better. So I decided just port some subset of it to kotlin and ran with my io_uring kotlin lib, and then it shows that kotlin lib is faster and becomes even faster when you increase the number of threads, but still slower than pure c io_uring implementation. I don't know. Maybe this benchmark is biased against synchronous read and optimized for io_uring :)

I would definitely be interested! That sounds great. I can just add it into my own project source code, for now, it's no big deal +1

I've opened a kotlin lib - kuring

GavinRay97 commented 1 year ago

The benchmark part is very interesting. I've tried to write a benchmark, and it was very similar to yours. And it also shows that simple synchronous read performs better. The gap was not significant and decreased when the buffer size was bigger.

Then I tried fio benchmark and it shows that io_uring performs better. So I decided just port some subset of it to kotlin and ran with my io_uring kotlin lib, and then it shows that kotlin lib is faster and becomes even faster when you increase the number of threads, but still slower than pure c io_uring implementation.

This is so strange to me, maybe I will write some people from JVM performance team and ask them.

Do you have any ideas why this could be?

ikorennoy commented 1 year ago

This is so strange to me, maybe I will write some people from JVM performance team and ask them.

For me either. If you get an answer could you share it with me?

Do you have any ideas why this could be?

No, not really. The benchmark which I ported from fio is quite simple, in case of 1 submission I just do

do {
    calls++
    val r = file.read(localBuffers[0], getOffset(maxBlocks), bufferSize)
    reaps++
    if (r != bufferSize) {
        println("Unexpected ret: $r")
    }
    localBuffers[0].clear()
    done++
} while (isRunning)

in case of several submissions I do:

withContext(this.coroutineContext) {
    val pending = arrayOfNulls<Deferred<Int>>(submitBatchSize)
    do {
        calls++
        for (i in 0 until submitBatchSize) {
            pending[i] = async {
                val r = file.read(localBuffers[i], getOffset(maxBlocks), bufferSize)
                localBuffers[i].clear()
                r
            }
        }
        pending.forEach {
            val r = it?.await()
            reaps++
            if (r != bufferSize) {
                println("Unexpected ret: $r")
            }
            done++
         }

For synchronous read I always do:

        do {
            calls++
            val r = file.read(buffer, getOffset(maxBlocks))
            if (r != bufferSize) {
                println("Unexpected ret: $r")
            }
            buffer.clear()
            done++
        } while (isRunning)

Then simple calculations:

iops = thisDone - done
bw = if (bufferSize > 1048576) {
    iops * (bufferSize / 1048576)
} else {
    iops / (1048576 / bufferSize)
}

With a buffer size of 4096 bytes and 32 submissions and 1 thread for io_uring it shows:

IOPS=76544, BW=299MiB/s, IOS/call=32/32
IOPS=79811, BW=311MiB/s, IOS/call=32/32
IOPS=79645, BW=311MiB/s, IOS/call=31/31

Synchronous read 4096 bytes 1 thread:

IOPS=16794, BW=65MiB/s, IOS/call=1/0
IOPS=16739, BW=65MiB/s, IOS/call=1/0
IOPS=16872, BW=65MiB/s, IOS/call=1/0

With a buffer size of 8192 bytes, 32 submissions, and 4 threads:

IOPS=183975, BW=1437MiB/s, IOS/call=32/32
IOPS=184222, BW=1439MiB/s, IOS/call=32/32
IOPS=185061, BW=1445MiB/s, IOS/call=31/31

Synchronous read 8192 bytes 4 threads:

IOPS=54958, BW=429MiB/s, IOS/call=1/0
IOPS=55163, BW=430MiB/s, IOS/call=1/0
IOPS=55294, BW=431MiB/s, IOS/call=1/0

All of these tests are done without O_DIRECT. With O_DIRECT I got a performance degradation and am trying to figure out how to fix it.

GavinRay97 commented 1 year ago

Wow, that is some seriously impressive throughput gains!

ikorennoy commented 1 year ago

If you are going to write JVM performance team, could you CC me korennoy.ilya at gmail.com :)

GavinRay97 commented 1 year ago

Yeah absolutely πŸ‘