Open carlbrown opened 7 years ago
I updated BlueSockets to the latest 4.0 toolchain (see https://github.com/IBM-Swift/BlueSocket/tree/UnsupportedWIP/) to see if that was the problem, and the BlueSocket tests run clean on Linux under the latest snapshots, so that's not it.
The investigation continues.
It appears that the issue is happening because, in the test that passes, the attempt to close the socket fails (although the test passes).
Here's the working output (with some extra prints) on the Mac:
Test testHelloEndToEnd() on port 53808
Connection Accepted
Listener called
process setting blocking mode
process resuming reader source <OS_dispatch_source: 0x7f8e8cf0b170> on socket 13
reader source event
reader source event about to read
reader source read 223 bytes
reader source sending to parser
reader source returned from parser
reader source event about to read
close called
socket valid calling close on it
checking reader queue <OS_dispatch_queue_serial: Read Queue 0[0x7f8e8ce2ef40]> before calling cancel
ReaderSource Event Error: Error code: -9982(0x-26FE), Bad file descriptor
close called
checking reader queue <OS_dispatch_queue_serial: Read Queue 0[0x7f8e8ce2ef40]> before calling cancel
reader source cancel
Calling close() on socket -1
close called
And here's the same (failing) call on Linux:
Test testOkEndToEnd() on port 43869
Connection Accepted
Listener called
process setting blocking mode
process resuming reader source Dispatch.DispatchSource on socket 9
reader source event
reader source event about to read
reader source read 228 bytes
reader source sending to parser
reader source returned from parser
close called
socket valid calling close on it
checking reader queue Dispatch.DispatchQueue before calling cancel
checking reader source Dispatch.DispatchSource before calling cancel
Test Case 'ServerTests.testOkEndToEnd' passed (0.021 seconds)
The two diverge at: ReaderSource Event Error: Error code: -9982(0x-26FE), Bad file descriptor
On the Mac when this happens, it indicates that the socket was closed as it was being read. This error doesn't happen on Linux (although the same close()
call does), leaving the socket open so that the next (and subsequent) tests hang.
I can rework things so that this doesn't happen in these tests (and probably will), but that leaves open the possibility that it can happen in the wild, causing a server hang, so it needs to get fixed.
Next step to see if I'm correct, is to replicate this error in a much simpler sample project.
Here is the output of two test runs on the same system, one from Swift3.1.1 and the other from last week's 4.0 Snapshot: swift.test.311.txt swift.test.40.txt
In both cases, the code is doing a blocking call to Glibc.recv()
in a DispatchSource's Read event handler when Glibc.shutdown()
is called. In Swift 3.1.1, we get Got Count 0 from Glibc.recv(-1)
, but in Swift 4.0, the Glibc.recv()
call doesn't return.
I'm also attaching a tar.gz of both source trees, to make it easier for other people to reproduce: issue_22_complete_source.tar.gz
Possibilities:
recv()
to hang.Hi Carl. Am I reading it correctly that this test used to work on Swift 4 development snapshots until about a week ago? Asking because Apple merged several of their own PRs into dispatch during the July 13-July 17 time period. I'm trying to figure out if it is worth selectively backing those out to see if one of them was buggy. They touched bits of code dealing with Linux signal handling, ioctls, and processing of epoll.
As far as i know, it's never worked on Swift 4. It certainly hasn't worked on any of the 4.0-DEVELOPMENT snapshots that I've tried. It has worked on all the 3.x ones. (Sorry about the "on the last few 4.0 snapshots" language. I was trying to indicate that I hadn't tested it on earlier ones, not that it had worked on earlier ones).
One of the things I will do if I run out of other things to try is to go back in time snapshot by snapshot until I find the one where it stopped working. That will be a lot of downloading snapshots, though, so I was hoping to figure it out some other way.
ok, thanks. Then I will look from first principles instead of focusing on what Apple changed in their PRs last week.
Ok, so good news and bad news.
The good news is, I've discovered via brute force that the regression happens between the 2017-05-30 and 2017-05-31 snapshots (e.g. the tests pass with the May 30th snapshot and hang/fail with the May 31st one).
The bad news is there are no libDispatch changes between the swift-DEVELOPMENT-SNAPSHOT-05-30-a
and swift-DEVELOPMENT-SNAPSHOT-05-31-a
tags. So if the snapshots are representative of the tags, it's not a libDispatch issue.
The worse news is that of the changes between those tags, the biggest change is the addition of URLProtocol
. I'm not sure how that could be causal, but the tests that are hanging do use URLSession
, so it's possible that's the issue.
sounds like its over to the Foundation team then.
Quite possibly. I'm just really not sure how a change to URLSession
(happening on one queue) could possibly cause Glibc.close(socket)
not to interrupt Glibc.recv(Socket)
on a different queue.
@pushkarnk
I've discovered can fix this problem by applying a change to BlueSocket
where we call Glibc.shutdown()
when close()
ing a socket that isConnected
in addition to isListening
. (c.f. https://github.com/carlbrown/BlueSocket/commit/f5dc916bf707b12b11634a89655f92c3836f5051#diff-cec92f2c4685d05c3d246f2bb8a42733L3324). This means that BlueSocket will call shutdown()
on both the TCP connection being used to transfer data and the TCP socket that is listening for new connections (instead of just closing the listening socket and hoping that cleans up the accepted ones as well).
I think this is probably the right thing to do in any case (well, really I think BlueSocket should have exposed a shutdown()
method, but that ship has sailed). However, that raises the problem that we have no way of knowing what other repositories out in the wild might break because they are relying on the existing behavior.
The other problem is: that "fix" in no way answers the question about why the previous version works fine on the Mac and on pre-May31st/pre-_HTTPURLProtocol versions of Swift, but fails on post-May30th/post-_HTTPURLProtocol versions.
So I'm not sure if I should apply the BlueSocket fix and just move on, or keep trying to answer the question about what changed, expecting that this won't be the only place that the post-_HTTPURLProtocol behavior might be causing people issues once they move to Swift4.
Opinions? @ianpartridge ? @pushkarnk ?
I tried to get the tests running, but the swift test
command fails for me with this:
Fetching https://github.com/IBM-Swift/CHTTPParser.git
Fetching https://github.com/IBM-Swift/BlueSocket.git
Fetching https://github.com/IBM-Swift/BlueSignals.git
Cloning https://github.com/IBM-Swift/BlueSocket.git
Resolving https://github.com/IBM-Swift/BlueSocket.git at 0.12.56
Cloning https://github.com/IBM-Swift/BlueSignals.git
Resolving https://github.com/IBM-Swift/BlueSignals.git at 0.9.48
Cloning https://github.com/IBM-Swift/CHTTPParser.git
Resolving https://github.com/IBM-Swift/CHTTPParser.git at 0.4.0
Compile CHTTPParser utils.c
Compile CHTTPParser http_parser.c
Compile Swift Module 'Socket' (3 sources)
Compile Swift Module 'Signals' (1 sources)
Compile Swift Module 'HTTP' (7 sources)
Compile Swift Module 'HTTPTests' (3 sources)
Compile Swift Module 'BlueSocketHTTP' (2 sources)
Compile Swift Module 'BlueSocketHTTPTests' (7 sources)
Compile Swift Module 'SwiftServerHTTPPackageTests' (1 sources)
Linking ./.build/x86_64-unknown-linux/debug/SwiftServerHTTPPackageTests.xctest
/usr/bin/ld.gold: error: cannot find -lCHTTPParser
/usr/bin/ld.gold: error: cannot find -lCHTTPParser
/usr/bin/ld.gold: error: cannot find -lCHTTPParser
clang: error: linker command failed with exit code 1 (use -v to see invocation)
<unknown>:0: error: link command failed with exit code 1 (use -v to see invocation)
error: terminated(1): /root/pushkar/swift/usr/bin/swift-build-tool -f /root/pushkar/http/.build/debug.yaml test
Any idea what I am missing here?
You have to remove the link
line from the module.map for CHTTPParser.
See the 0.5.0
tag: https://github.com/IBM-Swift/CHTTPParser/commit/2f5dc431b51544b4c1bd322854f3fbd76abe35b4#diff-f73852fef28faf8dabf66d2f28db6998L20
There's also a tar file in a comment above (https://github.com/swift-server/http/issues/22#issuecomment-316462312) that has the whole directory structure with everything checked out, if that's easier.
Thanks @carlbrown I can see the failures now. I'll start the investigation.
Any status on this @pushkarnk @ianpartridge ?
@carlbrown I see that before the failing tests this message is printed:
Error accepting client connection: Error code: -9994(0x-270A), Invalid argument
Does it mean the test isn't able to connect to the server? Do you see this message in your env?
@pushkarnk No, that means the socket was shutdown while the accept()
system call was listening for new connections (which invalidates the socket
passed as an argument).
You'll see that message on both the May 30th and May 31st snapshots.
The ReaderSource Event Error: Error code: -9982(0x-26FE), Bad file descriptor
message comes on Linux, when running the tests in LLDB and the tests seem to always pass.
What snapshot are you seeing them passing with? What changed since 29 days ago when you said you saw the failures?
@carlbrown I don't see them pass when simply run as swift test
. They pass when the binary is run in LLDB. Again, with GDB I see the hang. This is with the latest toolchain. I am trying to understand how a debugger can make a difference.
https://github.com/apple/swift-corelibs-foundation/pull/1195 has fixed this problem.
URLSession holds references to URLSessionTask. URLSessionTask holds a reference to an EasyHandle, which is a wrapper around a curl easy handle. Per this doc, to close a connection pertaining to an easy handle, curl_easy_cleanup needs to be called. Now, we call curl_easy_cleanup in the deinit for EasyHandle. In the early days of URLSession, to keep URLSessionTasks alive beyond the function in which they were locally held (because, the responses are async), we had introduced a retain cycle between URLSession and URLSessionTask, and we broke it after the task completed. Unfortunately, this cycle breaking was missed in the refactoring exercise. As a result, URLSessionTask objects accumulated in memory, holding EasyHandles along with them. EasyHandle.deinit() and hence curl_easy_cleanup were never invoked, and hence connections were kept alive. Hence this problem in swift-server/http. The client end did not initiate a connection shutdown and the server was stuck in read() blocking the queue on which reads() for the subsequent connections waited indefinitely.
swift test
now goes through with 0 failures.
With the last few swift-4.0 development snapshots, all but one of the
XCTests
with names that end withEndToEnd
fail on Linux with timeouts. These are the tests that start a server on a random port and create aURLSession
to send requests to it Those tests pass fine on Darwin (and on Linux under Swift 3.1.1).If you comment all but one of the EndToEnd tests out, then the one you leave in (which ever one you pick) will pass. With two EndToEnd tests in, regardless of the order you choose, the first one will complete, and the second one will time out.
I'm guessing something isn't being cleaned up correctly, but I'm not sure what. Will update as I learn more.