Windows crash on unregistering watched directory

lptr commented 4 years ago

This is a followup on issue https://github.com/gradle/gradle/issues/13135.

Looks like Gradle is crashing on Windows sometimes after https://github.com/gradle/gradle/commit/09185e8fc34ff9ff3b6792dacd60963fd9c78c64. After this commit we started to potentially unregister paths at the start of a build on Windows.

Crash log

lptr commented 4 years ago

FTR, the PDB generated by @wolfs points to line 128 as the crash location, which probably means the line below:

https://github.com/gradle/native-platform/blob/16b2181e9fdf949385b3aa05ec983e4c263864d2/src/file-events/cpp/win_fsnotifier.cpp#L128-L132

lptr commented 4 years ago

We currently cancel/close watch roots on Windows as part of the WatchPoint destructor. This allows the code to be simple and clean. Doing so relies on the assumption that we can fully close watch points before the destructor finishes.

In theory this is guaranteed by the SleepEx(0, true) call in the destructor that is supposed to process all outstanding APCs. The assumption seems to be well supported by all the tests we run as part of native-platform.

Also important here is that (unlike the Linux and macOS implementations) in the Windows implementation everything (including OS callbacks) should happen on the processing thread. This means we can rule out race conditions in general.

The crash here can be fully explained if we assume that we still get a callback from the OS about the WatchPoint after it has been destructed. In this case the path logged on line 129 would be destructed, and dereferencing it for logging could lead to a crash like the ones we see here.

However, this latter assumption conflicts with the previous assumption.

I see the following options:

1) calling SleepEx(0, true) does not always process every outstanding APC, 2) we don't call SleepEx() properly in some case.

I would like to keep believing that SleepEx() works as we expect it to, as otherwise we'd need to add a lot of defensive code around closing WatchPoints.

One explanation for why SleepEx() might not be called is if cancel() returns false in the destructor:

https://github.com/gradle/native-platform/blob/87dc78528645a3b64ac42f441e6816ffac7a8344/src/file-events/cpp/win_fsnotifier.cpp#L62-L71

I'm not sure why cancel() would return false (maybe there was an error?) but we could remove the if and always do SleepEx() regardless of cancel()'s result. I'll need to reproduce the problem to see if this fixes it.

lptr commented 4 years ago

OK, so here's a theory: suppose that for whatever reason the file handle associated with the watch point became invalid by the time we get to unregisterPath(). Our status is LISTENING at this point.

We try to first CancelIoEx() the handle, but it returns false, because the invalid handle cannot be cancelled. So far so good. We then proceed to report the error value, but before doing so, we call close() to make sure we don't leave anything dangling around.

Now close() is supposed to call CloseHandle() and set the status to FINISHED _even if CloseHandle() fails. We check the return value and log a SEVERE message if there's a problem, but we set the status nevertheless. That is, unless this happens:

If the application is running under a debugger, the function will throw an exception if it receives either a handle value that is not valid or a pseudo-handle value. This can happen if you close a handle twice, or if you call CloseHandle on a handle returned by the FindFirstFile function instead of calling the FindClose function.

Okay, let's unpack this. First and foremost let's acknowledge the breaking of the principle of least astonishment here. I mean, this is the only Windows function that we've seen so far that throws. (BTW, what's meant here is structured exceptions (SEHs), not std::exceptions, just to complicate things a bit further. We'll get back to this later.)

Second, the docs seem to be incorrect here, because we've seen such an exception be thrown by our code:

Couldn't cancel watch point C:\tcagent1\work\a16b87e0a70f8c6e\subprojects\composite-builds\build\tmp\test files\SamplesCompositeBuildIntegrationTest\can_run_app_when_mo...otlin_dsl\k79fe\basic\kotlin\my-app: device or resource busy: device or resource busy

My suspicion is that it's not the presence of the debugger that enables the exception, but the fact that we are compiling the code with debug symbols enabled. Thanks for the clarity, Microsoft!

Anyway, the exception from CloseHandle() bubbles up from cancel() and gets caught and reported here: https://github.com/gradle/native-platform/blob/b8f27b864ff82621cb80e127579c1a75b682be91/src/file-events/cpp/win_fsnotifier.cpp#L62-L71

That means if the exception happens we skip calling SleepEx() (and also skip calling close() again), and we log a warning to Java instead. We also leave the status of the (now destructed) object in LISTENING.

Because we didn't call SleepEx(), there might still be unprocessed events for this now destroyed watch point, that when handled try to de-reference the destroyed path and produce the crashes we've seen above.

(Another weird thing is we run MSVC with exception handling explicitly set to /EHsc, which to my understanding shouldn't catch SEHs. So wow did the catch (const exception& ex) in ~WatchPoint() catch that SEH from CloseHandle()?)

lptr commented 4 years ago

The first step here is to eliminate the crash (see https://github.com/gradle/native-platform/pull/223). Once we have that we can attempt to figure out why we get the device or resource busy error in the first place.

lptr commented 4 years ago

Even after handling the problem with close() I get crashed daemons. :( See build here: https://builds.gradle.org/buildConfiguration/Gradle_Check_VfsRetention_29_bucket8/35931402?buildTab=overview&focusLine=0

Looked up one of the crashes using llvm-pdbutil and this is what I got (line numbers after //):

Stack: [0x000000001aaf0000,0x000000001abf0000],  sp=0x000000001abee850,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [native-platform-file-events.dll+0x23b79] // WatchPoint::handleEventsInBuffer() @ win_fsnotifier.cpp:134
C  [native-platform-file-events.dll+0x23d02] // handleEventCallback() @ win_fsnotifier.cpp:76
C  [ntdll.dll+0x9fcde]
C  [ntdll.dll+0x9c6f4]
C  [KERNELBASE.dll+0x46891]
C  [native-platform-file-events.dll+0x245c9] // Server::runLoop() @ win_fsnotifier.cpp:320 // calls SleepEx()
C  [native-platform-file-events.dll+0x3203e] // AbstractServer::executeRunLoop() @ generic_fsnotifier.cpp:122
C  [native-platform-file-events.dll+0x319a0] // Java_net_rubygrapefruit_platform_internal_jni_AbstractFileEventFunctions_00024NativeFileWatcher_executeRunLoop0() @ generic_fsnotifier.cpp:172
C  0x00000000018289a7

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  net.rubygrapefruit.platform.internal.jni.AbstractFileEventFunctions$NativeFileWatcher.executeRunLoop0(Ljava/lang/Object;)V+0
j  net.rubygrapefruit.platform.internal.jni.AbstractFileEventFunctions$NativeFileWatcher.access$100(Lnet/rubygrapefruit/platform/internal/jni/AbstractFileEventFunctions$NativeFileWatcher;Ljava/lang/Object;)V+2
j  net.rubygrapefruit.platform.internal.jni.AbstractFileEventFunctions$NativeFileWatcher$1.run()V+26
v  ~StubRoutines::call_stub

(One thing obvious here is that SleepEx() indeed just calls the callback directly.)

Based on the line numbers we get an access violation error on line 134:

https://github.com/gradle/native-platform/blob/acf85a3bed5836901fbb8497f7318ab699bacff2/src/file-events/cpp/win_fsnotifier.cpp#L127-L141

That's the line we first touch the memory of the WatchPoint itself, i.e. we try to read its status. I'm assuming the access violation is because the WatchPoint has already been destructed by this time.

lptr commented 4 years ago

Here's the daemon log before the crash:

2020-06-12T13:26:36.515+0200 [INFO] [org.gradle.launcher.daemon.server.exec.LogAndCheckHealth] Starting build in new daemon [memory: 477.6 MB]
2020-06-12T13:26:36.517+0200 [INFO] [org.gradle.launcher.daemon.server.exec.ForwardClientInput] Closing daemon's stdin at end of input.
2020-06-12T13:26:36.518+0200 [INFO] [org.gradle.launcher.daemon.server.exec.ForwardClientInput] The daemon will no longer process any standard input.
2020-06-12T13:26:36.531+0200 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] The daemon has started executing the build.
2020-06-12T13:26:36.531+0200 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] Executing build with daemon context: DefaultDaemonContext[uid=5ada4520-4799-4fb5-95ac-09b6c634b71a,javaHome=C:\Program Files\Java\jdk1.8,daemonRegistryDir=C:\tcagent1\work\a16b87e0a70f8c6e\subprojects\dependency-management\build\tmp\test files\FailOnDynamicVersionsResolveIntegrationTest\fails_to_resolve_a_...c_version\41icn\daemon,pid=1300,idleTimeout=120000,priority=NORMAL,daemonOpts=-XX:MaxMetaspaceSize=512m,-XX:+HeapDumpOnOutOfMemoryError,-XX:HeapDumpPath=C:\tcagent1\work\a16b87e0a70f8c6e\intTestHomeDir\worker-1,-Xms256m,-Xmx512m,-Dfile.encoding=UTF-8,-Djava.io.tmpdir=C:\tcagent1\work\a16b87e0a70f8c6e\subprojects\dependency-management\build\tmp,-Duser.country=US,-Duser.language=en,-Duser.variant,-ea]
Watching the file system is an incubating feature.
Spent 42 ms registering watches for file system events
Virtual file system retained information about 0 files, 0 directories and 0 missing files since last build

According to the crash log it happened ~4 seconds after the last timestamp in the daemon log:

time: Fri Jun 12 13:26:40 2020
elapsed time: 5 seconds (0d 0h 0m 5s)

lptr commented 4 years ago

The native stack traces for the three test crashes in this build are identical, they all seem to fail at win_fsnotifier.cpp:134.

lptr commented 4 years ago

From the logs it seems that very little happened in the build before it failed. We basically just started up, empty VFS, started watching the project root and then KA-BOOM!

This seems to contradict the idea somewhat that the crash happens after we unregister a path...

gradle / native-platform

Windows crash on unregistering watched directory #218