joshkan / nvme-uring-pt

4 stars 0 forks source link

blk_rq_unmap_user improperly called from irq context #4

Open clay-mayers opened 2 years ago

clay-mayers commented 2 years ago

I believe there is an issue in the v6 0010 patch with nvme_end_async_pt(). That's the request's completion handler and is called under interrupt context but it's calling blk_rq_unmap_user(), which requires the process context at the time blk_rq_map_user() was called.

At least, that's according to the description for blk_rq_map_user_iov(), which does the heavy lifting for blk_rq_map_user().

joshkan commented 2 years ago

Yes, that's an issue. In the internal version (which is not out yet), I moved the entire completion-processing to task-work callback...so nothing happens in interrupt context anymore. Thanks for underlining this issue.

clay-mayers commented 2 years ago

We weren't using a task but mapping everything so it can be completed w/o a task but that one function was a problem. We're still on 5.10 but I patched in this part of your io_uring to switch to this scheme - for what it's worth, it's flipped the completion order of some of our pass-thru i/o. It comes out of the device cq as A/B but comes out of the io_uring cq as B/A. It likely doesn't matter but it's disturbing none the less.

clay-mayers commented 2 years ago

We've bumped into a small issue with using the submitter's task for completion. It means the submitter's thread has to live as long as the i/o does. Our use case is unusual so this likely won't be an issue for most users but it does mean, we can't use this patch as is with our user code as it is. I added code to capture the current's tgid task when the req is constructed and use that as a fallback. I don't think that's a good solution in general. We're still working out the best path forward.

joshkan commented 2 years ago

I feel the need of bit more clarity on env. .....could you share the snippet of code (capture the current tgid task part). Because the task is pinned when we schedule a work in the context of submitter thread (inside io_uring helper). Sounds similar to what you did. I assume that your software stack is linux-kernel that is very close to upstream.

clay-mayers commented 2 years ago

We are still on 5.10 - I can see now that the 5.17's io_req_task_work_add() has a fallback when task_work_add() fails. I'll assume it'll work - sorry and thanks for asking for clarity. We need to update to validate your patch instead of my hacking it into 5.10.

clay-mayers commented 2 years ago

We continue to have issues with using task work and I think you will as well. Once PF_EXITING is set in task->flags, task work won't be run. Because blk_mq_free_request() is called by nvme_pt_task_cb(), when a process crashes with an outstanding i/o request, the nvme request will be leaked along with its reference the controller queue. The next enum of namespaces will lock up.

In general, I don't think it's a good idea to create/use io_uring_cmd_complete_in_task() to call io_req_task_work_add(). I think better is getting access to an io_uring's task and use task_work_add() directly. The semantics of io_req_task_work_add() has little to do with the needs of an arbitrary driver's io_uring-cmd handler. More concretely, a driver can get an error return and implement its own fallback.

The problem, however, remains that task_work_add() can fail and blk_rq_unmap_user() requires the original process context. The problem, I think, is with the blk_rq_xxx_user() functions. I'm still researching but as best as I can tell there's a need for async versions that work w/o the original process context (at least the cleanup functions).

joshkan commented 2 years ago

Thanks for sharing it. I need to check and test that part indeed. Have not added process-crashing in the testing workflow so far. I have added this to my to-be-checked list. I will share the findings.

clay-mayers commented 2 years ago

Looking at the latest on git://git.kernel.dk/linux-block, the CB will still be made but on a scheduled work queue. And it seems all it takes is the thread that issued the i/o to exit, not the whole process. That means, blk_rq_unmap_user() won't be called in the processes context but I believe it can recognize it's on a work queue thread. Which works for a dead process but I think is unreasonable for a thread exiting - or at a minimum needs to be noted. Thanks for offering to share your results.

joshkan commented 2 years ago

May I know how do you produce the issue?

clay-mayers commented 2 years ago

We are still back on 5.10 where io uring tasks work differently than 5.18-nvme - so how to create an issue is still the same, but the symptoms will be different. My apologies for not yet testing it on something close to what you have yet.

We have a thread pool for completion - with our command set there are conditions where i/o results are processed just in the completion thread and the original thread that issued the i/o doesn't have to be around. When a multi-thread fio job completes, the cleanup is done by the last thread, which hangs waiting for the i/o to complete that was issued by an already exited thread.

The other condition is simply any test tool that crashes while lots of i/o is going on. Part of our testing deletes old namespaces and recreates them - this process hangs because the nvme completion routine is never called so the request was never freed leaking a reference to device queues.

Again this is with a patched up version of 5.10 where the in task cb isn't called. In 5.18-nvme, the in task cb will be called but it won't be in-task. It will be on a system queue, not in-task or in-process. blk_rq_unmap_user() I think can detect this and not crash, but read i/o with metadata can't copy the results back in the first case and in either case shouldn't try.