Fix race condition crash by removing unnecessary retain/release

miketsprague commented 5 years ago

What did you change and why?

Remove unnecessary retain/release logic in NTPConnection that was crashing after being released too many times in some race conditions.
Retaining here is not necessary. It was originally implemented here as part of fixing an issue where HostResolver didn’t retain itself for resolving connections when the initial attempt timed out. From what I can tell, it was just added here to be safe. @msanders if you’re still active and somehow remember, feel free to correct me if I’m wrong!

Potential risks introduced?

If I'm missing something, or if NTPConnection is used elsewhere in your app, since the socket isn't retaining it it's possible that a socket callback could cause a crash.
If query(addresses: [SocketAddress], host: String) is called when connections is not cleared, that could cause the same crash as above. I cannot find a case where this happens, and it's in a private extension.

What tests were performed (include steps)?

See steps to reproduce below. I've performed those, and additionally I will deploy my own fork in my codebase and will report back the stability results.

Checklist

[ ] Unit/UI tests have been written (if necessary)
[x] Manually tested

Additional Information

Why don’t we need to increment the retain count of `NTPConnection` for the socket’s callback?

NTPConnections are used by NTPClient, and NTPClient is used by the singleton TrueTimeClient. TrueTimeClient maintains a strong reference to NTPClient. NTPClient maintains a strong reference to [NTPConnection]. When removing the connections, each connection is synchronously closed (which removes itself from the socket callback), then the array is cleared. At no point is it necessary to increment the retain count from the sockets as long we’re guaranteed that close will be called before release our strong reference, which it is.

History of this bug

Increasing the retain count to fix bug with HostResolver
- This caused a memory leak because NTPConnection was not getting released on a timeout, only on success.
Fixing memory leak with timeouts
- This fixed the above issue, but caused the crash in 5.0.0/1
Fix specific crash from above
- Fixed the issue that @00FA9A was able to reproduce.
- This causes an increase in crashes, because it just increased the race condition surface area around when timeouts happen.

How do you reproduce the new crash?

Create a sample project where you start the TrueTimeClient instance and call fetchIfNeeded
Edit the timeout of the NTPConnection objects to be really small (ex: 0.0001) NTPConnection.swift line 28. If you felt more ambitious, you could fire up the Network Link Conditioner and set it to be 8 seconds or more.
Lock the thread for the length of time it’ll take the network response to come back in NTPConnection complete after the release() is called. I just added Thread.sleep(forTimeInterval: 0.5) on line 207.
Run the code. It will crash, assuming the timeoutError is hit.
Sample NTPConnection.swift if you want to try for yourself.

Okay but why does this happen?

There are actually two stack traces: one in NTPClient, and one in NTPConnection. The latter is the more prevalent one by about 10x. We’ll start with that one. Note: this is probably pretty hard to follow, and I'm not 100% sure about the correctness. All of the logic below is moot by this PR though, and these race conditions no longer exist.

Assume that we are running on a slow internet where the roundtrip time of the requests is about 8 seconds, conveniently the same as the default timeout :P

NTPConnections are created.
When they are started, the reference count of each is incremented by 1 with passRetained. The socket completion callback (dataCallback) is then enqueued to be executed on the main thread.
The timeout of 8 seconds hits (timeoutError), and complete(.failure(error)) is called. This timer callback happens on the lock queue (instacart npt connection queue).
complete(), calls close(), but notice in that implementation of close all of the code is executed asynchronously on the lockQueue. Calling lockQueue.async if you’re already on lockQueue will append that code to the end of the current runloop. This is important because it means that the socket callback isn't actually removed yet.
Back in complete(), we’re specifically a timeout error, so onComplete is called on the callbackQueue (instacart ntp client queue, this is a different thread).
Finally, release() is called if callbackPending is true (which it is here).
While the above is happening, the NTPConnection completion handler is still active, because it’s been appended to the end of the current run loop.
The socket success completion comes in at some point during this time and is run on the main thread (dataCallback). release() is then called again. Not good. We’ve now accidentally released this twice.
The next connection in the queue will eventually then call onComplete.
Our ref count for the original connection is one less than what it’s supposed to be. Go look at the definition for onComplete on line 35. Since there’s both the reference to connections in NTPClient and the connection here, the ref count for the NTPConnection is still one. We can call progress(connection, result) without any issue.
However, once throttleConnections is called, and the above extra reference in the closure is released, $0.canRetry now causes a crash because the reference count is 0.

I’m less sure about the NTPClient crash. I think it might be the same as above, except in that race condition the NTPConnection has managed to be released twice (maybe if it was also cancelled?) and the reference count is 0 for the progress callback. We’d expect it to crash then on line 186 of NTPClient.

tl;dr: we have several different threads with several connections happening at the same time that enqueue differing things, with logic that retains, then releases at different times depending on state (and potentially in different threads). This is both hard to follow and leads to potential race conditions. I do not see a reason for an extra retain and release here at all, since connections are retained by NTPClient and only released after synchronously closing each connection and removing the callback.

fddecc commented 5 years ago

@miketsprague thanks for looking into this - we've had limited time to devote to this issue. I'd like @msanders to also take a look if possible, since he will have most context.

Our apologies for taking time to resolve this issue.

miketsprague commented 5 years ago

Just reporting back here to share results of what I've seen in our app.

With 5.0.2, we had ~3,000 sessions crash from TrueTime over a two week period. In the latest version of our app with the changes from this PR included, we've seen 0 TrueTime crashes, and haven't seen an increase in OOM sessions. Assuming there's no weird side effects that I haven't caught, this has fixed the issue and is working as intended.

Ariandr commented 5 years ago

Hi @miketsprague That's great! The report is very promising, thank you :)

Hi @00FA9A @msanders Have you reviewed the PR? I believe it should be merged if everything is okay.

msanders commented 5 years ago

Sorry for missing this, looks like the mention notification got routed to the wrong email. From what I can tell this was just precautionary, although it's been a while since I looked at the code. The only thing I would confirm is that this isn't causing any memory leaks in instruments, but it sounds like you already did that with your OOM monitoring. Thank you for the great contribution @miketsprague!

instacart / TrueTime.swift