Closed lptr closed 4 years ago
FTR, the PDB generated by @wolfs points to line 128 as the crash location, which probably means the line below:
We currently cancel/close watch roots on Windows as part of the WatchPoint
destructor. This allows the code to be simple and clean. Doing so relies on the assumption that we can fully close watch points before the destructor finishes.
In theory this is guaranteed by the SleepEx(0, true)
call in the destructor that is supposed to process all outstanding APCs. The assumption seems to be well supported by all the tests we run as part of native-platform
.
Also important here is that (unlike the Linux and macOS implementations) in the Windows implementation everything (including OS callbacks) should happen on the processing thread. This means we can rule out race conditions in general.
The crash here can be fully explained if we assume that we still get a callback from the OS about the WatchPoint
after it has been destructed. In this case the path
logged on line 129 would be destructed, and dereferencing it for logging could lead to a crash like the ones we see here.
However, this latter assumption conflicts with the previous assumption.
I see the following options:
1) calling SleepEx(0, true)
does not always process every outstanding APC,
2) we don't call SleepEx()
properly in some case.
I would like to keep believing that SleepEx()
works as we expect it to, as otherwise we'd need to add a lot of defensive code around closing WatchPoint
s.
One explanation for why SleepEx()
might not be called is if cancel()
returns false
in the destructor:
I'm not sure why cancel()
would return false
(maybe there was an error?) but we could remove the if
and always do SleepEx()
regardless of cancel()
's result. I'll need to reproduce the problem to see if this fixes it.
OK, so here's a theory: suppose that for whatever reason the file handle associated with the watch point became invalid by the time we get to unregisterPath()
. Our status is LISTENING
at this point.
We try to first CancelIoEx()
the handle, but it returns false
, because the invalid handle cannot be cancelled. So far so good. We then proceed to report the error value, but before doing so, we call close()
to make sure we don't leave anything dangling around.
Now close()
is supposed to call CloseHandle()
and set the status to FINISHED
_even if CloseHandle()
fails. We check the return value and log a SEVERE
message if there's a problem, but we set the status nevertheless. That is, unless this happens:
If the application is running under a debugger, the function will throw an exception if it receives either a handle value that is not valid or a pseudo-handle value. This can happen if you close a handle twice, or if you call CloseHandle on a handle returned by the FindFirstFile function instead of calling the FindClose function.
Okay, let's unpack this. First and foremost let's acknowledge the breaking of the principle of least astonishment here. I mean, this is the only Windows function that we've seen so far that throws. (BTW, what's meant here is structured exceptions (SEHs), not std::exception
s, just to complicate things a bit further. We'll get back to this later.)
Second, the docs seem to be incorrect here, because we've seen such an exception be thrown by our code:
Couldn't cancel watch point C:\tcagent1\work\a16b87e0a70f8c6e\subprojects\composite-builds\build\tmp\test files\SamplesCompositeBuildIntegrationTest\can_run_app_when_mo...otlin_dsl\k79fe\basic\kotlin\my-app: device or resource busy: device or resource busy
My suspicion is that it's not the presence of the debugger that enables the exception, but the fact that we are compiling the code with debug symbols enabled. Thanks for the clarity, Microsoft!
Anyway, the exception from CloseHandle()
bubbles up from cancel()
and gets caught and reported here: https://github.com/gradle/native-platform/blob/b8f27b864ff82621cb80e127579c1a75b682be91/src/file-events/cpp/win_fsnotifier.cpp#L62-L71
That means if the exception happens we skip calling SleepEx()
(and also skip calling close()
again), and we log a warning to Java instead. We also leave the status of the (now destructed) object in LISTENING
.
Because we didn't call SleepEx()
, there might still be unprocessed events for this now destroyed watch point, that when handled try to de-reference the destroyed path
and produce the crashes we've seen above.
(Another weird thing is we run MSVC with exception handling explicitly set to /EHsc
, which to my understanding shouldn't catch SEHs. So wow did the catch (const exception& ex)
in ~WatchPoint()
catch that SEH from CloseHandle()
?)
The first step here is to eliminate the crash (see https://github.com/gradle/native-platform/pull/223). Once we have that we can attempt to figure out why we get the device or resource busy
error in the first place.
Even after handling the problem with close()
I get crashed daemons. :( See build here: https://builds.gradle.org/buildConfiguration/Gradle_Check_VfsRetention_29_bucket8/35931402?buildTab=overview&focusLine=0
Looked up one of the crashes using llvm-pdbutil
and this is what I got (line numbers after //
):
Stack: [0x000000001aaf0000,0x000000001abf0000], sp=0x000000001abee850, free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [native-platform-file-events.dll+0x23b79] // WatchPoint::handleEventsInBuffer() @ win_fsnotifier.cpp:134
C [native-platform-file-events.dll+0x23d02] // handleEventCallback() @ win_fsnotifier.cpp:76
C [ntdll.dll+0x9fcde]
C [ntdll.dll+0x9c6f4]
C [KERNELBASE.dll+0x46891]
C [native-platform-file-events.dll+0x245c9] // Server::runLoop() @ win_fsnotifier.cpp:320 // calls SleepEx()
C [native-platform-file-events.dll+0x3203e] // AbstractServer::executeRunLoop() @ generic_fsnotifier.cpp:122
C [native-platform-file-events.dll+0x319a0] // Java_net_rubygrapefruit_platform_internal_jni_AbstractFileEventFunctions_00024NativeFileWatcher_executeRunLoop0() @ generic_fsnotifier.cpp:172
C 0x00000000018289a7
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j net.rubygrapefruit.platform.internal.jni.AbstractFileEventFunctions$NativeFileWatcher.executeRunLoop0(Ljava/lang/Object;)V+0
j net.rubygrapefruit.platform.internal.jni.AbstractFileEventFunctions$NativeFileWatcher.access$100(Lnet/rubygrapefruit/platform/internal/jni/AbstractFileEventFunctions$NativeFileWatcher;Ljava/lang/Object;)V+2
j net.rubygrapefruit.platform.internal.jni.AbstractFileEventFunctions$NativeFileWatcher$1.run()V+26
v ~StubRoutines::call_stub
(One thing obvious here is that SleepEx()
indeed just calls the callback directly.)
Based on the line numbers we get an access violation error on line 134:
That's the line we first touch the memory of the WatchPoint
itself, i.e. we try to read its status
. I'm assuming the access violation is because the WatchPoint
has already been destructed by this time.
Here's the daemon log before the crash:
2020-06-12T13:26:36.515+0200 [INFO] [org.gradle.launcher.daemon.server.exec.LogAndCheckHealth] Starting build in new daemon [memory: 477.6 MB]
2020-06-12T13:26:36.517+0200 [INFO] [org.gradle.launcher.daemon.server.exec.ForwardClientInput] Closing daemon's stdin at end of input.
2020-06-12T13:26:36.518+0200 [INFO] [org.gradle.launcher.daemon.server.exec.ForwardClientInput] The daemon will no longer process any standard input.
2020-06-12T13:26:36.531+0200 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] The daemon has started executing the build.
2020-06-12T13:26:36.531+0200 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] Executing build with daemon context: DefaultDaemonContext[uid=5ada4520-4799-4fb5-95ac-09b6c634b71a,javaHome=C:\Program Files\Java\jdk1.8,daemonRegistryDir=C:\tcagent1\work\a16b87e0a70f8c6e\subprojects\dependency-management\build\tmp\test files\FailOnDynamicVersionsResolveIntegrationTest\fails_to_resolve_a_...c_version\41icn\daemon,pid=1300,idleTimeout=120000,priority=NORMAL,daemonOpts=-XX:MaxMetaspaceSize=512m,-XX:+HeapDumpOnOutOfMemoryError,-XX:HeapDumpPath=C:\tcagent1\work\a16b87e0a70f8c6e\intTestHomeDir\worker-1,-Xms256m,-Xmx512m,-Dfile.encoding=UTF-8,-Djava.io.tmpdir=C:\tcagent1\work\a16b87e0a70f8c6e\subprojects\dependency-management\build\tmp,-Duser.country=US,-Duser.language=en,-Duser.variant,-ea]
Watching the file system is an incubating feature.
Spent 42 ms registering watches for file system events
Virtual file system retained information about 0 files, 0 directories and 0 missing files since last build
According to the crash log it happened ~4 seconds after the last timestamp in the daemon log:
time: Fri Jun 12 13:26:40 2020
elapsed time: 5 seconds (0d 0h 0m 5s)
The native stack traces for the three test crashes in this build are identical, they all seem to fail at win_fsnotifier.cpp:134
.
From the logs it seems that very little happened in the build before it failed. We basically just started up, empty VFS, started watching the project root and then KA-BOOM!
This seems to contradict the idea somewhat that the crash happens after we unregister a path...
This is a followup on issue https://github.com/gradle/gradle/issues/13135.
Looks like Gradle is crashing on Windows sometimes after https://github.com/gradle/gradle/commit/09185e8fc34ff9ff3b6792dacd60963fd9c78c64. After this commit we started to potentially unregister paths at the start of a build on Windows.
Crash log