scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.25k stars 1.54k forks source link

Question about io_uring usage. #1673

Open MBkkt opened 1 year ago

MBkkt commented 1 year ago

Could you explain why is io_uring_peek_batch_cqe used here instead of io_uring_for_each_cqe?

https://github.com/scylladb/seastar/blob/master/src/core/reactor_backend.cc#L1508

I understand that they do different things, but it looks like submit part of io_uring_peek_batch_cqe is unnecessary in your usage. Maybe I'm wrong, please correct me in this case. If I correct it looks like io_uring_for_each_cqe can be used and unnecessary copy can be avoided. WDYT?

Code:

unsigned io_uring_peek_batch_cqe(struct io_uring *ring,
                 struct io_uring_cqe **cqes, unsigned count)
{
    unsigned ready;
    bool overflow_checked = false;
    int shift = 0;

    if (ring->flags & IORING_SETUP_CQE32)
        shift = 1;

again:
// for each part
    ready = io_uring_cq_ready(ring);
    if (ready) {
        unsigned head = *ring->cq.khead;
        unsigned mask = *ring->cq.kring_mask;
        unsigned last;
        int i = 0;

        count = count > ready ? ready : count;
        last = head + count;
        for (;head != last; head++, i++)
            cqes[i] = &ring->cq.cqes[(head & mask) << shift];

        return count;
    }

    if (overflow_checked)
        goto done;

// submit part
    if (cq_ring_needs_flush(ring)) {
        int flags = IORING_ENTER_GETEVENTS;

        if (ring->int_flags & INT_FLAG_REG_RING)
            flags |= IORING_ENTER_REGISTERED_RING;
        ____sys_io_uring_enter(ring->enter_ring_fd, 0, 0, flags, NULL);
        overflow_checked = true;
        goto again;
    }

done:
    return 0;
}
MBkkt commented 1 year ago

https://github.com/scylladb/seastar/blob/57acd00402ae6a2fb4eec93e81c9ef1647cc5efe/src/core/reactor_backend.cc#L1517 Probably I don't right, it looks like it needed here, but anyway avoiding copying and buffer on stack looks possible

tchaikov commented 1 year ago

hi @MBkkt, io_uring_for_each_cqe() does load-acquire and store-release. probably we can ignore this performance overhead on x86 architectures, though. do you have some benchmark numbers so we can have a better understanding of how this change can help with the performance?

avikivity commented 1 year ago

The reason is likely that I wasn't aware of for_each_cqe.

avikivity commented 1 year ago

In fact it was only documented a few days ago: axboe/liburing@16d74b1c76043e628726bf43cd91332e262879c6.

MBkkt commented 1 year ago

I don't think using this macros directly is good idea for non x86: https://github.com/axboe/liburing/issues/867

I just created this issue, because I looked your code to understand better how could I work with liburing, and thought that buffer and maybe submit look unnecessary.

So I don't have any measurements and feel free to just close it :)

tchaikov commented 1 year ago

I don't think using this macros directly is good idea for non x86: axboe/liburing#867

i disagree. we should use this macro no matter on x86 or on non-x86. as the producer and consumer could be running on different cores. the chance of racing is rare, but it exists.

I just created this issue, because I looked your code to understand better how could I work with liburing, and thought that buffer and maybe submit look unnecessary.

So I don't have any measurements and feel free to just close it :)

i see. i'd prefer keeping it open. i will try to benchmark these two versions. as i believe that the compiler should just compile the atomic_load to a plain load instruction without adding any barriers for x86, so it is a pure win on x86. i will give it a try.

MBkkt commented 1 year ago

i disagree. we should use this macro no matter on x86 or on non-x86. as the producer and consumer could be running on different cores. the chance of racing is rare, but it exists.

I were not about not_atomic load instead of atomic, I just pointed that In the macros it will be done on every iteration.