[Loom] Interrupt behaviour differs from the system implementation

kohlschutter / junixsocket

Unix Domain Sockets in Java 7 and newer (AF_UNIX), AF_TIPC, AF_VSOCK, and more

Apache License 2.0

433 stars 114 forks source link

[Loom] Interrupt behaviour differs from the system implementation #158

Closed cenodis closed 2 months ago

cenodis commented 5 months ago

Split off from #157

The current 2.10.0 snapshot has added support for virtual thread I/O. As part of this change the Sockets and SocketChannels have gained the ability to be interrupted to support task cancellation.

While junixsocket does properly respond to an interrupt by throwing an exception, the type of the exception thrown as well as the state of the socket and thread after an interruption do not match the default implementation in the JDK.

Short overview

This table shows the state of the program immediately after the virtual I/O thread has received an interrupt and returned to the calling code:

	Java		Junixsocket
	Socket	SocketChannel	Socket	SocketChannel
Exception thrown	`SocketException`	`ClosedByInterruptException`	`InterruptedIOException`	`InterruptedIOException`
Thread interrupt flag	set	set	unset	unset
Socket closed?	closed	closed	open	open

Details

Socket

While the current Socket implementation is valid according to the specification I think there is value in matching the behaviour of the default implementation as close as possible to minimize potential incompatibilities. This is especially relevant when using junixsocket to "fake" a normal IP socket and passing it into a library that normally does not support domain sockets.

SocketChannel

java.nio.channels.SocketChannel on the other hand does document that it must throw a ClosedByInterrupt exception, interrupt the current thread and close the channel if it is interrupted during an I/O operation. So the current implementation is not strictly compliant with the specification.

Open Socket and InterruptedIOException

The exception in combination with the still open socket may also present a rather unique issue.

The problem with leaving the socket open after an interrupt is that it's not obvious if the I/O operation has completed fully, failed or only partially completed. This leaves the socket in an inconsistent state in which the caller does not necessarily know which bytes should be processed next to ensure a well formed message is created. If it is a read operation then certain bytes may also have been read from the underlying socket but not arrived at the caller due to the exception, requiring the socket to perform otherwise unnecessary buffering to ensure no information is lost.

It is theoretically possible to pass the necessary information to the caller via InterruptedIOException and its bytesTransferred field which according to its documentation:

Reports how many bytes had been transferred as part of the I/ O operation before it was interrupted.

I would like to advocate against using this exception however.

Currently junixsocket does not set this field at all, meaning it is always 0. I am not sure this is correct as to my understanding it's possible for the async runtime to submit the operation to the OS before the interrupt exception is thrown.

Throwing InterruptedIOException along with leaving the socket open could result in code attempting to handle this by rewinding its buffer and playing the "missing" bytes back. But if the number of missing bytes is incorrect this will instead corrupt the socket stream.

This entire "recovery" process feels very brittle and easy to misuse even if the byte count is correct. It may be even further complicated if there are multiple layers of buffering which would each need to rewind. Closing the socket implicitly on an interrupt avoids this scenario altogether since no replay can be attempted.

Considering how easy it is to implement custom interrupt behaviour with virtual threads and the fact that the JDK has also chosen this approach, I do not see much benefit to support these kind of half-writes.

kohlschuetter commented 5 months ago

Thanks for reporting, @cenodis!

Please verify against the current snapshot (changes referenced above).

If there are still discrepancies between junixsocket and vanilla Java, please provide some code/unit test to demonstrate, this will be a tremendous help.

cenodis commented 5 months ago

The current implementation still throws unusual exceptions and keeps the socket open under some conditions.

I have taken a stab at writing some unit tests for interrupt behaviour. The tests and the results on my linux machine can be found here: InterruptTest.java. They are not very pretty but should cover the interrupt behaviour for all blocking methods (accept, connect, read, write) on the standard Unix socket.

Assuming the test code is free of bugs, here are some observations I have made based on the results:

Once a connection is established AFSocketChannel always reports itself as open. Even if a ClosedByInterruptException was thrown. Looking a bit into the code, AFSocketChannel (and its subclasses) do not appear to make use of the begin() and end(boolean) methods defined in AbstractInterruptibleChannel, breaking its API contract:

A concrete channel class must invoke the begin and end methods before and after, respectively, invoking an I/ O operation that might block indefinitely.

These methods appear to be responsible for setting the private closed field and translating the interrupt or close to the appropriate exception.
The results of some test seem to depend on when the interrupt happens relative to the start of the blocking operation. Some fail when the interrupt happens immediately but succeed if the interrupt happens some time after the operation has already started. I suspect this might be caused by some methods performing an early check of the interrupt flag to avoid performing unnecessary, expensive operations. Those checks then fail to properly update the state of the socket resulting in these discrepancies. To try and cover these variations every test is excuted twice: Once with an immediate interrupt and once with an interrupt after a delay of 500ms.

The tests currently only cover the basic Unix domain socket. I unfortunately lack the means to test more OS specific socket types. But it should be fairly straightforward to extend the test with those if desired.

cenodis commented 5 months ago

Edit: Disregard this. The interrupt flag does need to be set. I accidentally cleared it during my tests.

~~I have to make a small correction: When I originally created the overview of the Threads interrupt status I interpreted this portion of the javadoc:~~

~~ClosedByInterruptException – If another thread interrupts the current thread while the connect operation is in progress, thereby closing the channel and setting the current thread's interrupt status~~

as meaning that the Threads interrupt flag is set when a ClosedByInterruptException is thrown. In practice this does not appear to be the case. Instead the javadoc describes the state of the Thread before an exception is thrown, with the exception clearing the flag.

~~I have updated both the table above and the InterruptTest to reflect this.~~

kohlschuetter commented 2 months ago

@cenodis Thanks for providing these additional details! I've added some more changes to make your tests pass, apart from the aforementioned interrupt flag state — and ClosedChannelException vs. ClosedByInterruptException: ClosedByInterruptException actually is a subclass of ClosedChannelException.

Can you please try the latest code (either from main or SNAPSHOT builds, 2.10.0-20240628.170903-8 or newer) and report back? Cheers!

PS: I'm happy to add your test class you referenced above, if you're willing to contribute the code under the Apache 2.0 license to this project.

cenodis commented 2 months ago

I have run the exact same tests with the newest snapshot version (c0853fe according to git.properties) and still get some failures: https://gist.github.com/cenodis/a7bb9144aee2ddc4e660649557c57993

Test [3] seems to fail or succeed somewhat randomly while [5] and [6] fail consistently on my machine. For completeness I have included a report from one of the runs where [3] has also failed.

The tests are run on an Ubuntu machine with Linux 6.3.6 if that helps.

ClosedChannelException vs. ClosedByInterruptException: ClosedByInterruptException actually is a subclass of ClosedChannelException.

Not completely sure what you are trying to tell me with that. Why is the inheritance chain of ClosedByInterruptException important? The SocketChannel should throw either a ClosedByInterruptException or an AsynchronousCloseException for interrupts and closes respectively.

I'm happy to add your test class you referenced above, if you're willing to contribute the code under the Apache 2.0 license to this project.

Sure. The code is so basic I have my doubts whether its even licensable. But if it makes you more comfortable I give my blessing to use this code with whatever license this project uses.

I have also taken a look at the changes you have commited so far and there is something that stands out to me. You have a few blocks that have this shape:

begin();
try {
  // blocking op
} catch (/*exceptions*/) {
  if (Thread.currentThread().isInterrupted()) {
    throw new ClosedByInterruptException();
  }
  // more exception handling
} finally {
  end(complete);
}

This interrupt check strikes me as redundant. If an interrupt occurs after (or during) begin() then AbstractInterruptibleChannel::end(boolean) is responsible for detecting it and then throwing the appropriate ClosedByInterruptException. Checking the threads interrupt status manually is therefore unnecessary.

And another thing:

try {
  end(complete);
} catch (ClosedByInterruptException e) {
  throw closeAndThrow(e);
}

end(boolean) handles both interrupts and async close events. It can throw both ClosedByInterruptException and AsynchronousCloseException. Based on my understanding of your code it should catch both to ensure that closeAndThrow is always called. Though I do wonder why this block is even necessary to begin with. Both interrupts and async closes always call AbstractInterruptibleChannel::implCloseChannel (which in turn calls implCloseSelectableChannel()) so the channel should already be fully closed at this point?

kohlschuetter commented 2 months ago

Please try again with the latest snapshot (2.10.0-20240630.191437-9 corresponding to commit 7025a0495d96770d4d1ba76b986fc824518e786e).

I've reworked both the exception handling code as well as your unit test, which is now also included in the selftest. Thanks again, for allowing me to include the test, @cenodis!

Not completely sure what you are trying to tell me with that. Why is the inheritance chain of ClosedByInterruptException important? The SocketChannel should throw either a ClosedByInterruptException or an AsynchronousCloseException for interrupts and closes respectively.

Previously, the test code expected ClosedByInterruptException, whereas now any subclass of ClosedChannelException is valid. However, if ClosedByInterruptException is thrown, the test will check if the Thread#interrupted flag is set.

The Java API specs around SocketChannel/ServerSocketChannel/DatagramChannel permit any kind of IOException to be thrown, not just ClosedByInterruptException and AsynchronousCloseException, particularly ClosedChannelException, which is the common parent class. That one should be thrown if the channel is already closed upon calling, for example.

I think the main change, compared to older versions is that now we properly throw ClosedChannelException (an IOException subclass) instead of a junixsocket-specific SocketClosedException (a SocketException subclass), and that now in all cases the socket should be properly closed.

If there are still discrepancies in what exceptions are thrown compared to the JVM implementations, I would argue that these are an implementation detail that should be ignored. Better follow the specs as to what may happen that what currently does...

Though I do wonder why this block is even necessary to begin with. Both interrupts and async closes always call AbstractInterruptibleChannel::implCloseChannel (which in turn calls implCloseSelectableChannel()) so the channel should already be fully closed at this point?

Unfortunately, "closed" may mean multiple things , so we need to make sure that all resources are properly closed by calling our own close method when required. This currently may or may not be required, but it doesn't hurt since this code is only called when an exception is thrown. I hope that helps.

cenodis commented 2 months ago

Previously, the test code expected ClosedByInterruptException

Yes, and in my opinion this is the behaviour required by the specification. InterruptIssue158Test always interrupts the blocking thread so a ClosedByInterruptException is always expected. Replacing it with a generic ClosedChannelException means code written like this breaks:

try {
  channel.blockingOp();
} catch (ClosedByInterruptException e) {
  // ...
} catch (ClosedChannelException e) {
  // ...
}

To quote InterruptibleChannel:

A channel that implements this interface is asynchronously closeable: If a thread is blocked in an I/O operation on an interruptible channel then another thread may invoke the channel's close method. This will cause the blocked thread to receive an AsynchronousCloseException.

A channel that implements this interface is also interruptible: If a thread is blocked in an I/O operation on an interruptible channel then another thread may invoke the blocked thread's interrupt method. This will cause the channel to be closed, the blocked thread to receive a ClosedByInterruptException, and the blocked thread's interrupt status to be set.

If a thread's interrupt status is already set and it invokes a blocking I/O operation upon a channel then the channel will be closed and the thread will immediately receive a ClosedByInterruptException; its interrupt status will remain set.

To me this reads that if the the blocking operation is interrupted then the method is required to throw a ClosedByInterruptException or subclass thereof, not just a generic ClosedChannelException.

Additionally the blocking methods on SocketChannel all explicitly document the interrupt and async close exceptions. If it was permitted to throw a generic superclass then these cases would be documented as "optional specific" exceptions (see java.nio.file.Files::delete). The fact that this isn't the case means it is required to throw the most specific exception.

Example from SocketChannel::connect(SocketAddress):

ClosedChannelException – If this channel is closed

AsynchronousCloseException – If another thread closes this channel while the connect operation is in progress

ClosedByInterruptException – If another thread interrupts the current thread while the connect operation is in progress, thereby closing the channel and setting the current thread's interrupt status

2 and 3 should already be handled by AbstractInterruptibleChannel assuming both begin and end are properly used for all blocking operations and implCloseChannel wakes up the thread currently blocking on the channel. If complete does not throw a ClosedByInterruptException for the scenarios in InterruptIssue158Test then there is a bug in the test or the channel (or both).

The Java API specs around SocketChannel/ServerSocketChannel/DatagramChannel permit any kind of IOException to be thrown

Yes, but only if you look exclusively at the function signature. And by that logic it could also throw HttpRetryException. As stated above the javadoc specs do list the individual exceptions that can be thrown by a method and it should always throw the most specific exception unless the exception is documented as an "optional specific exception".

Better follow the specs as to what may happen that what currently does

Agreed. I try to avoid including any implementation specific behaviour. See above for my understanding of the specification. And feel free to call out anything you see as implementation specific.

Unfortunately, "closed" may mean multiple things, [...]. This currently may or may not be required, but it doesn't hurt

Thats fine. I just saw a piece of code that looked strange to me and wanted to point it out. Not to demand any changes.

kohlschuetter commented 2 months ago

OK, so I think I have it now. There are a few things I was able to change, and now we get the expected ClosedByInterruptException in all cases where we expect them.

I had to modify your test case again in one particular regard though: Because of a race condition, the interrupt may have occurred after closing the socket from the server-side (there was no delay). In that case, the latest iteration of my changes correctly returned AsynchronousCloseException.

Moreover, I reworked the tests so we can now verify the exact behavior of JEP380 (and Java Inet) sockets on the same code (see the three subclasses in commit 0bed969)

Surprisingly, without a delay of closing the socket (run with -Dselftest.issue.158.delay-close=false), the JEP380 implementation will not throw any exception (which is permitted by the test): the client somehow "successfully" sends the single byte to the server, even though it clearly does not read the byte on the server-side (!)

On the other end, adding a significant delay will cause the tests to work as expected in all cases.

Regarding your concern about specific exceptions, I can only again stress that checking for specific exceptions is not recommended. Using exception handling for flow control is extremely brittle (I've run into it myself while fixing this bug; see commit 66fb640ea41422ba65251e1cdf5ffefb9e69aa05).

By running the test code against the Java SDK JEP380 code, I was able to occasionally trigger a case where testClientInterruptionWithDelay variant 6 throws an unexpected exception that is just a plain IOException, not even an AsynchronousCloseException:

java.io.IOException: Broken pipe
    at java.base/sun.nio.ch.SocketDispatcher.write0(Native Method)
    at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:62)
    at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:137)
    at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:102)
    at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:58)
    at java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:542)
    at org.newsclub.net.unix/org.newsclub.net.unix.InterruptIssue158Test.lambda$16(InterruptIssue158Test.java:115)
    at org.newsclub.net.unix/org.newsclub.net.unix.InterruptIssue158Test.runOperation(InterruptIssue158Test.java:257)
    at org.newsclub.net.unix/org.newsclub.net.unix.InterruptIssue158Test.lambda$26(InterruptIssue158Test.java:168)
    at java.base/java.lang.VirtualThread.run(VirtualThread.java:309)

So it's fair to say that your assumptions around certain exception types don't hold true even for the Java SDK JEP380 implementation. This actually goes against their own javadocs, see SocketChannel.java, lines 642-647:

    /**
     * @throws  NotYetConnectedException
     *          If this channel is not yet connected
     * @throws  ClosedChannelException      {@inheritDoc}
     * @throws  AsynchronousCloseException  {@inheritDoc}
     * @throws  ClosedByInterruptException  {@inheritDoc}
     */
    public abstract long write(ByteBuffer[] srcs, int offset, int length)
        throws IOException;

I will file a bug to Oracle shortly afterwards. (I do not see this behavior when using standard Java AF_INET sockets, for what it's worth).

Notably, I agree that in this case it should have been an AsynchronousCloseException (or, if the interrupt was received fast enough, ClosedByInterruptException). Therefore, I'm not going to change my code back to mimic the Java SDK :-)

kohlschuetter commented 2 months ago

Please verify with code from the latest commit 9f5377f88a3b3138e22690d04758bfd34e88fb59.

You can also run the latest selftest jar as follows:

java \
    -Dselftest.issue.158.debug=true \
    -Dselftest.enable-module.junixsocket-common.JEP380=true \
    -Dselftest.enable-module.junixsocket-common.JavaInet=true \
    -jar junixsocket-selftest-2.10.0-20240701.224420-13-jar-with-dependencies.jar

Try inserting

    -Dselftest.issue.158.delay-close=false \

after the first line to disable the delay, which should then occassionally trigger the bug in JEP380 code (verified on a MacBook Pro M1 Max). It will also show that junixsocket will throw an AsynchronousCloseException (which is expected, as described above; the test will fail regardless)

kohlschuetter commented 2 months ago

Java bug reported to Oracle, JDK-8335600

cenodis commented 2 months ago

Tested on 262827d and now all relevant methods throw a proper ClosedByInterruptException on interrupts, report the socket as closed and have the thread set as interrupted. Seems good to me.

Thank you for putting up with this nitpicky issue and updating everything to be in line with the spec. Even the optional parts in Socket. I imagine chasing specific error classes is annoying to deal with. Don't think me hammering on specific wording in the docs made it any better.

Surprisingly, without a delay of closing the socket [...] the client somehow "successfully" sends the single byte to the server, even though it clearly does not read the byte on the server-side

That does seem very strange. I could imagine a world where the OS reads the byte into cache even if the consumer application is not ready yet. But that doesn't fit with the even greater delay resulting in a "not-send" state. Can't say I have an idea on what could cause this.

I had to modify your test case

I see where I made the mistake now. Sorry about that. I reviewed it twice and have no idea how that slipped past me.

checking for specific exceptions is not recommended. Using exception handling for flow control is extremely brittle

I actually agree with this sentiment. On the other hand sometimes exception types are the only way to meaningfully distinguish between different kinds of errors. I don't think exceptions vs explicit checks is a productive discussion, at least not in this context. Java has decided to use exceptions to model error scenarios, so its for the best that the ecosystem remains as compatible as possible when interacting with parts of the spec.

This actually goes against their own javadocs

Agreed. This does seem like a bug in the SDK.

kohlschuetter commented 2 months ago

@cenodis I'm glad we got this resolved. Investigating this bug was quite fruitful, as we found a couple more issues in junixsocket, plus a JDK bug :)

kohlschuetter commented 2 months ago

junixsocket 2.10.0 has been released. Please verify and re-open if necessary. Thanks again for reporting , @cenodis !