swiftlang / swift

The Swift Programming Language
https://swift.org
Apache License 2.0
67.56k stars 10.35k forks source link

Compiler hangs occasionally on many-core CPUs on Windows #73532

Closed hjyamauchi closed 5 months ago

hjyamauchi commented 5 months ago

Description

The swift compiler occasionally hangs during a build. This is seen more frequently on many-core (> 16 cores) machines, In particular AMD threadripper CPUs with 32 cores / 64 threads. There are several swiftc processes that are left running and making no progress, with no (child) swift-frontend processes, when it happens.

Reproduction

This happens in a large internal app build on Windows. In some particular AMD threadripper machines, it happens 100% of the time. We have seen something hang in other machines much less frequently, which may be the same issue.

Expected behavior

The build doesn't hang and finishes, as opposed to hanging forever.

Environment

Windows

Additional information

No response

hjyamauchi commented 5 months ago

This has been seen to happen in rare cases only (a few random processes) in in our internal large swift application build.

The symptom is that the compiler driver thread is stuck waiting forever in the while loop in Process.waitUntilExit after the child process that it is waiting for has already finished.

Based on inspections with the debugger, the self.isRunning flag is true which indicates that the CFSocketCreateWithNative callback never fired, even though the child process already finished.

Fortunately, adding a small amount of logging doesn't change the reproducibility, but unfortunately adding too much logging makes it go away. And as it's been so far only reproducible in a large swift build which involves many, many invocations of swiftc/swift-frontend processes and it is hard to know in which it occurs and to attach a debugger in real time. Most of the debugging so far relied on limited amount of logging.

A further investigation shows that the reason why the callback never fires seems to be that some arbitrary socket file descriptors occasionally get dropped (some bits in the bit vector cleared/unset) for unknown reasons after they are put into the __CFReadSocketsFds bit vector. This causes those file descriptors to never be tested on the select call and the above callback never fires. It seems to happen around the time it gets resized via the CFDataIncreaseLength call in __CFSocketFdSet. If I effectively turn off the resizing by increasing the initial size of bit vector, this hang reliably goes away. So I suspect a bug in the resize code, but couldn't spot a bug in the resize code and verified with extra debugging code that the bit vector contents are identical before and after the resize at least right before/after the resizing still within the same critical section. However, when a different thread accesses the same bit vector in subsequent critical sections, it occasionally finds that some bits are dropped.

I also checked that the access to the bit vector is properly synchronized but no issues found. I also instrumented in the other points in the code where bits in the bit vector could be potentially cleared but didn't find anything suspicious. This looks like a data corruption of some kind and my current theory is some sort of race-y data/heap corruption in lower-level code such as a race-condition bug in the underlying memory allocators (CFDataAllocator, etc.) or the lock implementation (CFLock, etc.) unless it's broken CPU/hardware or something like that.

hjyamauchi commented 5 months ago

https://github.com/apple/swift-corelibs-foundation/pull/4951 is a suggested workaround that reliably avoids this hang by reducing the chance of bitvector resizing by allocating a larger initial size. Ideally we'd fix the root cause but given the cost/benefit tradeoff and that this code is deprecated and is going to be replaced by swift-foundation, I hope this will at least unblock us and allow us to further experiment around this issue.

lxbndr commented 5 months ago

I am afraid that on Windows it is even more complicated. I did some research a while ago on this, because we noticed that creating too much CFSockets makes test app hang. The reason of such weird behavior is the fd_set is not a bit set on Windows. And the CoreFoundation code is written with bit set in mind. I guess we have no other choice other than rewrite some parts to use platform-specific fd_set handling to make everything work correctly. And this is quite challenging task, as bit set gives some advantages and simplifies a lot of things (like the capacity grow you mentioned).

I stopped working on this because the only issue I noticed was one synthetic test. It is unfortunate that this issue affects the compiler in such drastic way 😞

Here is my WIP commit with initial fix I made. Just for reference. tbh I even don't remember all "how and why"s, but hope it describes the idea at least. And it fixes fd_set growth problem in vitro.

hjyamauchi commented 5 months ago

@lxbndr Oh my... thanks for posting and the patch :) I'm intrigued by the fact that it works to this extent despite this issue 🤯

I confirmed that your WIP commit reliably fixes the hang in our internal build, as is.

Would you be willing to put up a PR out of it? That would definitely unblock us. It'd be great if we can merge it.

lxbndr commented 5 months ago

@hjyamauchi I guess we can do that, even if it is not perfect. If it makes sense and fixes real issues, it worth to try.

hjyamauchi commented 5 months ago

@lxbndr thanks for the fix!