rt: Hang on too small completion queue

ollie-etl commented 2 years ago

Similar to #145, and again exercised by #144. If there size of the completion queue is smaller than the number of concurrent writers to the submission queue, the runtime just hangs.

Probably not commiting?

FrankReh commented 2 years ago

Yes, the driver may be making too strict assumptions about when to submit and when to check the cqueue.

The driver isn't flushing the squeue and handling the cqueue as often as it could and it can use the overflow bit to make it take an extra loop before getting back to the caller.

And also, like another issue and commit found and handled, it is not enough to just submit when the thread is going idle, there will be times we want submit and tick should run without being idle first. Whether we wait for the squeue to be full or not is debatable. The squeue drains completely with any call to submit so its probably okay to wait for the squeue to become full but the app can decide what the size of the squeue is and if it is made to large, it can starve the kernel from work it could be processing on. Some day I want to ask the uring author what they think the right sizing is, but maybe this crate's benchmark abilities will already have given us some good ideas.

I believe my thinking about this earlier, and my pseudo fix, was any time the squeue is found full, we should run a tick first but we don't want to recurse and call a tick again if a tick completion handler had tried to make its own submission entry and was thwarted by finding the squeue full too. We also don't want to force the kernel to go into overflow mode if we can help it because that creates a slower path and burns cycles that will sometimes be avoidable by checking the cqueue periodically.

And there is also the non-sys call mode which the driver doesn't have an option for yet which would let the two sides run in parallel even more efficiently.

Great ways are being found to stress the system reliably. Will make it easier to push performance and detect when regressions are happening.

Noah-Kennedy commented 2 years ago

I think that we should probably take an approach here similar to what tokio does with when it calls epoll_wait. This would mean getting a hook into tokio to let us run a callback to flush the queue and dispatch completions during the maintenance loop. Tokio's maintenance loop runs every set number of scheduler ticks (a scheduler tick is basically the polling of a task). This would give us a good way to periodically flush the queue.

FrankReh commented 2 years ago

@Noah-Kennedy Are you saying there is already a hook in tokio we can use, or you're going to try and get one added.

If that is new, perhaps before that lands, we can implement our own maintenance countdown in the op creation to trigger the same uring maintenance.

Noah-Kennedy commented 2 years ago

I'm going to try to get one added. In the meantime, I'm wondering if it is best to address this in the op creation or to do a "duck tape and zip ties" fix by having an always ready task that no-ops and yields back most of the time like this (pardon my pseudocode):

async fn maintenance_hack() {
    loop {
        for _ in 0..16 {
            // yield back to the runtime 16 times
            tokio::task::yield_now().await;
        }

        // flush the queue and dispatch completions.
        // I'm acting like the thread locals are statics here for simplicity's sake.
        DRIVER.tick();
    }
}

This will basically run once every 16 full passes through the scheduler.

Noah-Kennedy commented 2 years ago

Crap, that won't work. It will cause us to busy loop instead of blocking on epoll_wait.

Noah-Kennedy commented 2 years ago

Yeah, we should probably try and do this like @FrankReh was suggesting for the time being.

Noah-Kennedy commented 2 years ago

And long-term move to a hook in tokio.

Noah-Kennedy commented 2 years ago

The more that I think about it, this probably can't be solved in the op submission without removing the ability to batch SQE submissions. I'm wondering if it might make sense for us to do something like my suggestion but with some added logic around only letting the task run if there are unflushed submission queue events.

Noah-Kennedy commented 2 years ago

Ah I was thinking of a different issue here. Disregard my earlier comments, they are important to do but won't fix this issue. We should definitely dispatch CQEs first anytime we submit SQEs.

Noah-Kennedy commented 2 years ago

We should also be using NODROP, but I don't think that is in tokio-rs/io-uring. We would need to add that flag first and get a release out in there.

Noah-Kennedy commented 2 years ago

I forgot that NODROP is a feature, not a flag...

Noah-Kennedy commented 2 years ago

Does this occur with the NODROP feature enabled?

FrankReh commented 2 years ago

I've got buffer ring groups working in a fork and streaming multi-accept working in a fork. I think I'd like to get those in before we try to address how to handle a flood of submissions or a flood of responses because it is easier to create the floods and then we can see all the reasons for doing so more easily (I think).

But my streaming multi-accept has a cancel ability in it that may not fit with your plans for cancel but I'd like it to get a little more light because it had a nice aspect or two.

FrankReh commented 2 years ago

Does this occur with the NODROP feature enabled?

NODROP just means the kernel will queue up things using extra kernel memory, waiting for the cq to be drained by us. So it doesn't help if our problem is not getting the kernel to see our sq entries or we aren't pulling from the cq because we thought we did but what we didn't do was ask for more to be pulled. That's what using the OVERFLOW bit is for.

Noah-Kennedy commented 2 years ago

Ah, I see the issue. I didn't realize that the problem was that we had extra completions we didn't notice due to overflow. Sorry for the confusion!

FrankReh commented 2 years ago

Once I started handling the OVERFLOW bit, my test hangs went away.

My streaming fork is waiting for the slab linked list work by @ollie-etl to be merged. I think that PR was left in his court but I'll double check...

Yes, I did a review for him yesterday and he'll also need to resolve conflicts with the new master.

FrankReh commented 2 years ago

It's not a blocker but I want to get that writev PR committed as the author had put good work into that and we should be able to clear it off our table with no other dependencies.

Noah-Kennedy commented 2 years ago

I understand your earlier comments now. Once we get through the current PR queue, we can get this fix in and do a release.

FrankReh commented 2 years ago

How about a release 4.0 tomorrow or Friday, regardless of what else is done. And none of our planned work is a breaking change so maybe a 4.1 about a week later? I'll work on change log stuff too. How do I share change log with you the best way? Just a regular PR?

FrankReh commented 2 years ago

I meant 0.4.0 and 0.4.1.

Noah-Kennedy commented 2 years ago

@FrankReh we should have done this a while ago TBH, but we need a CHANGELOG.md file where we can document changes. We will format it as with tokio, and we can use this to add relnotes for the release tags.

Noah-Kennedy commented 2 years ago

We should probably list this as a "known issue"

FrankReh commented 2 years ago

We should probably list this as a "known issue"

But it's good to see if it can be addressed while the OP has cycles and the ability to easily reproduce from their end. With your PR in, let's see if @ollie-etl can reproduce the problem.

I can make it my priority to get something handling OVERFLOW in for @ollie-etl to then check out. I don't know that this can't be solved relatively easily so let's see about getting this closed quickly.

FrankReh commented 2 years ago

I could create a self contained minimal reproducer that hangs at exactly the size of the completion queue, just as the title predicts

oliverbunting commented 1 year ago

My, you guys have been busy! I'll try and take a look at these this morning

FrankReh commented 1 year ago

I have a solution that solves it for my test case. It just involves a change to the io-uring crate. The tokio-uring crate can be left at master.

@Noah-Kennedy , @ollie-etl , would you like to change your Cargo.toml to use my fork/branch and report whether it fixes the problem you are seeing too?

io-uring = { git = "https://github.com/frankreh/io-uring", branch = "frankreh/cqueue-overflow", features = ["unstable"] }

Noah-Kennedy commented 1 year ago

@FrankReh what did you change?

FrankReh commented 1 year ago

I changed the overflow bit handling in submit. I think I sent the idea out last night in one of these other issues.

Noah-Kennedy commented 1 year ago

How did you change it though, and do you think that this would be appropriate for a PR in io-uring?

FrankReh commented 1 year ago

Yes, I think it's appropriate. Just easier if someone else can vouch for it making a difference too.

FrankReh commented 1 year ago

How did you change it though, and do you think that this would be appropriate for a PR in io-uring?

I don't understand where this question is coming from. Isn't it easier for you to look at the diff of the commit I put into that branch?

Noah-Kennedy commented 1 year ago

Sorry, I was on mobile earlier and not able to easily see, should have mentioned that. Looking at your change rn, this looks fine.

FrankReh commented 1 year ago

This should be solved once https://github.com/tokio-rs/io-uring/pull/152 or its equivalent is merged and this crate's version of io-uring is appropriately bumped.

Noah-Kennedy commented 1 year ago

Does that change alone fix this, or is #152 also needed?

Noah-Kennedy commented 1 year ago

Nevermind, I just saw the PR in here.

tokio-rs / tokio-uring

rt: Hang on too small completion queue #147