instacart / TrueTime.swift

NTP library for Swift and Objective-C. Get the true time impervious to device clock changes.
https://tech.instacart.com/truetime/
Apache License 2.0
589 stars 131 forks source link

Fix race condition crash by removing unnecessary retain/release #75

Closed miketsprague closed 5 years ago

miketsprague commented 5 years ago

What did you change and why?

Potential risks introduced?

What tests were performed (include steps)?

Checklist

Additional Information

Why don’t we need to increment the retain count of NTPConnection for the socket’s callback?

NTPConnections are used by NTPClient, and NTPClient is used by the singleton TrueTimeClient. TrueTimeClient maintains a strong reference to NTPClient. NTPClient maintains a strong reference to [NTPConnection]. When removing the connections, each connection is synchronously closed (which removes itself from the socket callback), then the array is cleared. At no point is it necessary to increment the retain count from the sockets as long we’re guaranteed that close will be called before release our strong reference, which it is.

History of this bug

How do you reproduce the new crash?

Okay but why does this happen?

There are actually two stack traces: one in NTPClient, and one in NTPConnection. The latter is the more prevalent one by about 10x. We’ll start with that one. Note: this is probably pretty hard to follow, and I'm not 100% sure about the correctness. All of the logic below is moot by this PR though, and these race conditions no longer exist.

Assume that we are running on a slow internet where the roundtrip time of the requests is about 8 seconds, conveniently the same as the default timeout :P

I’m less sure about the NTPClient crash. I think it might be the same as above, except in that race condition the NTPConnection has managed to be released twice (maybe if it was also cancelled?) and the reference count is 0 for the progress callback. We’d expect it to crash then on line 186 of NTPClient.

tl;dr: we have several different threads with several connections happening at the same time that enqueue differing things, with logic that retains, then releases at different times depending on state (and potentially in different threads). This is both hard to follow and leads to potential race conditions. I do not see a reason for an extra retain and release here at all, since connections are retained by NTPClient and only released after synchronously closing each connection and removing the callback.

fddecc commented 5 years ago

@miketsprague thanks for looking into this - we've had limited time to devote to this issue. I'd like @msanders to also take a look if possible, since he will have most context.

Our apologies for taking time to resolve this issue.

miketsprague commented 5 years ago

Just reporting back here to share results of what I've seen in our app.

With 5.0.2, we had ~3,000 sessions crash from TrueTime over a two week period. In the latest version of our app with the changes from this PR included, we've seen 0 TrueTime crashes, and haven't seen an increase in OOM sessions. Assuming there's no weird side effects that I haven't caught, this has fixed the issue and is working as intended.

Ariandr commented 5 years ago

Hi @miketsprague That's great! The report is very promising, thank you :)

Hi @00FA9A @msanders Have you reviewed the PR? I believe it should be merged if everything is okay.

msanders commented 5 years ago

Sorry for missing this, looks like the mention notification got routed to the wrong email. From what I can tell this was just precautionary, although it's been a while since I looked at the code. The only thing I would confirm is that this isn't causing any memory leaks in instruments, but it sounds like you already did that with your OOM monitoring. Thank you for the great contribution @miketsprague!