Possible bug with cancelling file operations

attila0x2A commented 6 years ago

While handling cancel for file_read_operation and file_write_operation we should make sure that Overlapped will not get released before the I/O operation completes. Here is a good explanation of why it is important.

So in addition to calling CancelIoEx it is probably required to either call GetOverlappedResult with bWait=TRUE or add an event to the Overlapped and wait on it. I'm not completely sure if it's best to use GetOverlappedResult in this case because of the warning from the documentation

If the hEvent member of the OVERLAPPED structure is NULL, the system uses the state of the hFile handle to signal when the operation has been completed. Use of file, named pipe, or communications-device handles for this purpose is discouraged. It is safer to use an event object because of the confusion that can occur when multiple simultaneous overlapped operations are performed on the same file, named pipe, or communications device. In this situation, there is no way to know which operation caused the object's state to be signaled.

Sorry if I missed the mechanism of how you're making sure that Overlapped structure does not get released before I/O completes.

lewissbaker commented 6 years ago

The main mechanism that is used to ensure the OVERLAPPED structure does not get released is by storing the OVERLAPPED structure as a member of the file_read_operation and file_write_operation awaitable objects.

The I/O operation is not started until the operation object is awaited. It is the responsibility of the awaiting coroutine to ensure the operation object being awaited lives at least until the operation completes. Usually the operation object is stored as a local variable or temporary within the coroutine frame and so you get the correct lifetime by default.

If the operation does not complete synchronously then the awaiting coroutine is suspended until the operation completes. All local variables in the coroutine frame are kept alive until the coroutine is resumed. The awaiting coroutine is only resumed when the I/O completion event for that operation is posted to the I/O completion port and is subsequently processed by an I/O thread.

If an operation is executed with an associated cancellation_token then the operation object will subscribe to cancellation notification by constructing a cancellation_registration object as a member of the operation object. This cancellation callback is unsubscribed when the operation completes. When a thread calls request_cancellation() on a corresponding cancellation_source then this thread will execute the cancellation callback for the operation which will then call CancelIoEx. If the OS responds to this request for cancellation before the operation completes then an I/O completion event will be posted to the I/O completion port with an error code that indicates the operation was cancelled. Otherwise, if the operation completes before the cancellation request can be processed by the OS then the operation will complete normally.

There is a potential race here between the operation completing on one thread and cancellation being requested concurrently on another thread. We don't want the cancellation callback to pass a destructed OVERLAPPED structure in a call to CancelIoEx(). To guarantee this, we ensure that the cancellation callback is unsubscribed before returning control to the resumed coroutine which might subsequently destroy the operation object (and thus the OVERLAPPED structure). In the case that the cancellation callback is already executing on another thread, the unsubscribe process will block until after the callback has finished executing. This should ensure that the OVERLAPPED structure should remain valid until after the CancelIoEx() call returns.

I hope that helps.

@Uran198 Is there any behaviour you are seeing that leads you to believe that there are some potential lifetime issues in the code?

I have seen some tests fail while performing file I/O, but only on MSVC x86 optimised builds - so I've been putting those failures down to bad optimised codegen by MSVC for now.

attila0x2A commented 6 years ago

I have not seen any strange behavior, I was just looking through the code and this struck me as strange.

If the OS responds to this request for cancellation before the operation completes then an I/O completion event will be posted to the I/O completion port with an error code that indicates the operation was cancelled. Otherwise, if the operation completes before the cancellation request can be processed by the OS then the operation will complete normally.

If I understand correctly how asynchronous operations work, then after returning from CancelIoEx you do not have a guarantee that operations got canceled and overlapped structure is no longer in use. So you can get into a scenario where you call CancelIoEx, then you destruct file_read_operation object, and only now OS completes the cancellation of the I/O marking it as canceled. The issue here is that OS has a reference to the overlapped, so it will try writing to the memory which has already been released.

Based on the article I sent earlier can't you run into the following situation?

file_read_operation op()
co_await op              | -> | I/O begins
--                       | -- | --
op.cancel() {
    CancelIoEx           | -> | I/O cancellation submitted to device driver
    return
}
<op is destroyed>
                         |    | Device driver was busy reading from the hard drive
                                Device driver receives the cancellation
                                Device driver abandons the rest of the read operation
                                Device driver reports that I/O has been canceled
                                I/O subsystem writes STATUS_CANCELED to OVERLAPPED structure
                                I/O subsystem queues the completion function (if applicable)
                                I/O subsystem signals the completion event (if applicable)
                                I/O operation is now complete

Also the documentation mentions that the overlapped structure is no longer in use only after the operation completes, not after the call to CancelIo or CancelIoEx

When canceling asynchronous I/O, do not reuse overlapped structures with targeted cancellation. Once the I/O operation completes (either successfully or with a canceled status) then the overlapped structure is no longer in use by the system and can be reused.

lewissbaker commented 6 years ago

The cancel() method on the operation class is private and is only called by the internals of the win32_overlapped_operation_cancellable<OPERATION> base-class in response to a request for cancellation via the cancellation_token. API users cannot call .cancel() on the operation object directly - this prevents .cancel() from being called on a potentially destructed object.

The op object in your example will not be destructed until the after the co_await op statement completes. This will only happen once the I/O completion event for that operation is posted to the I/O completion port and has been processed by some thread. i.e. after the "I/O subsystem signals the completion event" line in your timeline.

The combination of the mechanics of win32_overlapped_operation_cancellable ensuring that cancel() is only called while the operation object is still alive and the co_await op mechanics ensuring that the awaiting coroutine is not resumed until an I/O completion event is received and this in-turn ensuring that the operation object being awaited is not destroyed until after the statement containing the co_await op expression completes should mean that the OVERLAPPED structure lives until after the I/O operation completes.

lewissbaker commented 6 years ago

@Uran198 Let me know if you still have concerns.

attila0x2A commented 6 years ago

@lewissbaker Thank you for the explanation! I'm sorry I got confused and assumed that when you request cancellation with cancel_source it cancels right away.

But now, I cannot figure out what is the correct way to cancel and release all the resources?

For example, if I create several tasks that do I/O, pass cancellation_token to them and join them using when_all with a task that calls io_service::process_events. Then, from another thread, I want to stop all the processing and release all the resources taken by coroutines. To do this, I should call cancellation_source::request_cancellation, wait while all of them are actually canceled and released by io_service, and then stop io_service, right? So is there a way that I can tell by using the library when all the cancel events were processed and it's safe to stop io_service?

Another example is if I want to process a lot of I/O requests and then terminate. So again I create tasks and join then with when_all with a task that calls io_service::process_events. Is there a way I can tell that all the I/O operations are finished and I can destroy io_service?

lewissbaker commented 6 years ago

You can use a combination of sync_wait, when_all and on_scope_exit to hook up running the event loop to run until a task completes. For example, see https://github.com/lewissbaker/cppcoro/blob/862e99d8173ecb62a39abfd2b543de66bad98adc/test/io_service_tests.cpp#L129-L139

This top-level task can in turn run as many I/O operations as you like. The task will complete only once all I/O operations have completed. As operations are not actually started until you await them, you can guarantee that a task will only complete once all operations it is composed of have completed.

See also #30 for wrapping this boiler-plate up in a helper function. (Edit: Fixed issue number)

attila0x2A commented 6 years ago

Thanks a lot @lewissbaker for helping me better understand a library and how to work with it!

You probably wanted to mention this issue https://github.com/lewissbaker/cppcoro/issues/30.

lewissbaker / cppcoro

Possible bug with cancelling file operations #68