microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
162.16k stars 28.54k forks source link

Startup of insiders is hung when opening anything #142786

Closed rchiodo closed 2 years ago

rchiodo commented 2 years ago

Does this issue occur when all extensions are disabled?: Can't tell. Went away on restart

Steps to Reproduce:

  1. Open a folder
  2. Open a notebook

I get this:

image

roblourens commented 2 years ago

I hit this once too. That warning message implies that loading extensions is stuck, cc @deepak1556 in case this is related to the electron update?

deepak1556 commented 2 years ago

This seems to be same as https://github.com/microsoft/vscode/issues/140114 but we didn't have a clear repro at that time. @alexdima is there any additional logs we can obtain to narrow down where exactly the extension host got stuck when trying to load extensions ?

alexdima commented 2 years ago

Another occurrence in https://github.com/microsoft/vscode/issues/142766

alexdima commented 2 years ago

@bpasero @deepak1556 Do we plan to ship Electron 16 on stable? This might be a blocker.

deepak1556 commented 2 years ago

@alexdima yup the current plan is to ship this milestone.

Maybe we can add some additional logging to help narrow down where in the extension load code path things are getting stuck, this can help if we get similar reports once shipped to stable. Currently we had only a single occurrence of this issue for those who encountered it, so there is not much we could progress on diagnosing it. I didn't want to block the update because of that.

Also if we revert to Electron 13, we will be on an EOL release line https://www.electronjs.org/blog/electron-17-0#end-of-support-for-13xy and anything > 13 will have render process reuse by default but the downside will be not having self hosted with insiders.

alexdima commented 2 years ago

@deepak1556 I was able to reproduce when running scripts\code.bat and doing a few Ctrl+R on a Windows VM (i.e. slow machine). Maybe it needs 10 reloads, but it repros eventually. I am using a TS file with a compilation error and the hint is that the compilation error doesn't show. There is no other indication that something is wrong. I believe that this might be the same weird behavior around named pipes / sockets that do not send data we saw when we met some time ago trying out Playwright for Electron.

@bpasero @deepak1556 Could you please also try to reproduce?

deepak1556 commented 2 years ago

Oh was not aware of that, will give it a try. Thanks for the steps!

bpasero commented 2 years ago

I tried to reproduce in my Windows 11 ARM VM and was not successful. Tried a few window reloads on a small TypeScript project with an editor open that has a compile error.

deepak1556 commented 2 years ago

@bpasero I am also on Windows 11 ARM vm and couldn't repro it so far, but the trick to repro it is to have some memory intensive tasks running in the background (building electron does the trick for me). I am now able to repro the issue.

deepak1556 commented 2 years ago

Created a minimal repro https://gist.github.com/deepak1556/59f981cd8a55cda92beadd6bf97330a9, that shows a ping/pong via nodejs sockets between renderer process and child process forked from the main process. After a couple of reloads, the server/renderer will get into a state were the async tasks gets queued up and does not process until another explicit task is executed to flush out the queue.

An example state when event loop is paused:

Forked process: Child process created
Forked process: Connected to named pipe as client
[4124:0223/200645.092:INFO:CONSOLE(35)] "Connected to named pipe as server", source: file:///C:/Users/demohan/github/electron-quick-start/renderer.js (35)
Forked process: rec'd from server: pong!
[4124:0223/200658.631:INFO:CONSOLE(48)] "rec'd from client: ping 0", source: file:///C:/Users/demohan/github/electron-quick-start/renderer.js (48

At this point, executing a setImmediate(() => { console.log('hello') }) resumes the loop for sometime but it gets to the bad state after a few more tasks. At first I thought it might be a side effect of the occlusion tracker on windows https://chromium.googlesource.com/chromium/src/+/master/docs/windows_native_window_occlusion_tracking.md that puts the renderer into a suspended state but the bug repros even when launching with --disable-features=CalculateNativeWinOcclusion. Pretty confident that this bug is what we are seeing at the larger scale with our extension host setup. Continuing to investigate why Electron renderer gets into this state, looks like a return of https://github.com/electron/electron/issues/1968 from the past with render process reuse enabled changes.

zcbenz commented 2 years ago

It seems that after reloading the window, the IOCP will get into a weird state and event loop integration will stop functioning, so a workaround is to recreate the IOCP for each reload:

--- a/shell/common/node_bindings.cc
+++ b/shell/common/node_bindings.cc
@@ -576,7 +576,11 @@ void NodeBindings::LoadEnvironment(node::Environment* env) {
 }

 void NodeBindings::PrepareMessageLoop() {
-#if !BUILDFLAG(IS_WIN)
+#if BUILDFLAG(IS_WIN)
+  if (uv_loop_->iocp && uv_loop_->iocp != INVALID_HANDLE_VALUE)
+    CloseHandle(uv_loop_->iocp);
+  uv_loop_->iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 2);
+#else
   int handle = uv_backend_fd(uv_loop_);

   // If the backend fd hasn't changed, don't proceed.

I'm still trying to figure out the root cause.

deepak1556 commented 2 years ago

Currently testing the above workaround with vscode.

deepak1556 commented 2 years ago

After a bit of testing, the extension host does not get unresponsive when the iocp is reinitialized on reload. Although, we now get a prompt that says Reloading the window is taking longer than expected.. and it seems the issue arises from pending operation join.closeStorage. I will have to investigate why it regressed.

@zcbenz instead of reinitializing the iocp for the same process reload case, can we instead close the default uv loop and also perform the background thread cleanup which is usually done only during ~NodeBindings, and reinitialize after reload ? Since we destroy the node environment for reload, I feel this should be safe to do, thoughts ?

zcbenz commented 2 years ago

When navigating Chromium may load the new page before the old page is destroyed, so it is possible for multiple environments sharing the same loop in one process, and we probably don't have a good timing to close the default uv loop. But reinitializing IOCP has the same problem too.

In theory destroying and initializing Node.js environments should not affect the uv loop at all, by reading the code I couldn't find anywhere closing or initializing the uv loop, but somehow having multiple Node.js environments is making our node integration run into weird states, there were some efforts working around it (https://github.com/electron/electron/pull/25869 and https://github.com/electron/electron/pull/27582) but I think they are not fixing the root cause.

zcbenz commented 2 years ago

It seems that the node integration only breaks when the first node environment is destroyed, everything works fine if we just create and destroy subframes or child windows. So I wonder if we can work around the problem by creating a dummy node environment in renderer before any web frame is created and keeping it alive forever.

The root cause is still mysterious to me though.

deepak1556 commented 2 years ago

@zcbenz looked a bit more into the root issue and have some updates,

a) First let me walkthrough the basics of Electron event loop integration on windows and make sure I haven't misunderstood any of the components.

1) Electron polls the IOCP on background thread via GetQueuedCompletionStatus to know when there is completion packet available for async I/O operation posted to this IOCP by the libuv event loop.

2) Once there is a packet available Electron will post the packet back into the queue via PostQueuedCompletionStatus and interrupt the main thread via the chromium task runner and runs the libuv event loop on the main thread which will now process the completion packet.

3) The IOCP is created by libuv and only once when the first call to uv_default_loop was made via https://github.com/libuv/libuv/blob/4296fec7f50145e2a307f3db7ae22984713976a7/src/win/core.c#L236

4) Even across reloads when the process is reused the IOCP created in the previous step remains the same as the default loop never gets destroyed.

b) Next a couple of principles when using IOCP stated in https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-completion-ports which are relevant for this bug,

1) A thread (either one created by the main thread or the main thread itself) uses the GetQueuedCompletionStatus function to wait for a completion packet to be queued to the I/O completion port, rather than waiting directly for the asynchronous I/O to complete. Threads that block their execution on an I/O completion port are released in last-in-first-out (LIFO) order, and the next completion packet is pulled from the I/O completion port's FIFO queue for that thread. This means that, when a completion packet is released to a thread, the system releases the last (most recent) thread associated with that port, passing it the completion information for the oldest I/O completion.

2) Although any number of threads can call GetQueuedCompletionStatus for a specified I/O completion port, when a specified thread calls GetQueuedCompletionStatus the first time, it becomes associated with the specified I/O completion port until one of three things occurs: The thread exits, specifies a different I/O completion port, or closes the I/O completion port. In other words, a single thread can be associated with, at most, one I/O completion port.

3) When a completion packet is queued to an I/O completion port, the system first checks how many threads associated with that port are running. If the number of threads running is less than the concurrency value (discussed in the next section), one of the waiting threads (the most recent one) is allowed to process the completion packet. When a running thread completes its processing, it typically calls GetQueuedCompletionStatus again, at which point it either returns with the next completion packet or waits if the queue is empty.

4) The most important property of an I/O completion port to consider carefully is the concurrency value. The concurrency value of a completion port is specified when it is created with CreateIoCompletionPort via the NumberOfConcurrentThreads parameter. This value limits the number of runnable threads associated with the completion port. When the total number of runnable threads associated with the completion port reaches the concurrency value, the system blocks the execution of any subsequent threads associated with that completion port until the number of runnable threads drops below the concurrency value.

5) The system also allows a thread waiting in GetQueuedCompletionStatus to process a completion packet if another running thread associated with the same I/O completion port enters a wait state for other reasons, for example the SuspendThread function. When the thread in the wait state begins running again, there may be a brief period when the number of active threads exceeds the concurrency value. However, the system quickly reduces this number by not allowing any new active threads until the number of active threads falls below the concurrency value.

c) Finally what is happening in the process reuse scenario on windows,

1) Libuv creates IOCP with concurrency value 1 (i-e) only one active running thread can process the completion packet at a time. (point b.4)

2) Although the polling thread NodeBindings::EmbedThreadRunner makes the call to GetQueuedCompletionStatus, on creation it will first wait on the semaphore that gets signalled from the main thread only when NodeBindings::UvRunOnce gets called, which we actually trigger for the first time when the process is created via NodeBindings::RunMessageLoop. This actually causes the main thread to make a call to uv_run which in turn will call to GetQueuedCompletionStatus and hence main thread gets associated with the IOCP (point b.2)

3) But since the background thread is in wait state on the semaphore we still satisfy (point b.3 and point b.5) and hence are able to process the IOCP on this thread once the signal arrives from the main thread via NodeBindings::PollEvents.

4) Now on reload, when reinitializing the default node environment we make calls to NodeBindings::PrepareMessageLoop and NodeBindings::RunMessageLoop which ends up creating new background thread and the thread count increases depending on the number of reloads. It can happen in this situation than more than one background thread gets to operate on the IOCP due to race and it ends up getting suspended on evaluating the GetQueuedCompletionStatus due to (point b.4)

At this point the background thread is unable to signal back to the main thread and we see it as user facing bug that non of node callbacks got executed as from the repro. Another test that confirmed this was, expose NodeBindings::UvRunOnce to the user land and when the test enters this error state try calling into UvRunOnce explicitly and you will see that main thread is still able to complete the GetQueuedCompletionStatus calls and is able to trigger the callbacks. This confirms that IOCP is healthy and the problem is only because of the background threads got suspended.

The workaround posted in https://github.com/microsoft/vscode/issues/142786#issuecomment-1050650994 helped because on reload although there are multiple background threads, for the new IOCP only main thread and only newly created background thread are associated, so they continue to work based on the previously listed principles.

Based on the above results, I now try to maintain a single background thread across reloads into the same process and this always works fine. I found that there was an attempt in Electron that would have achieved the same https://github.com/electron/electron/pull/27582 but was later reverted https://github.com/electron/electron/pull/28175. The cause for hang before the revert is incorrect, the problem was not with IOCP but rather in https://github.com/electron/electron/pull/27582 although we short-circuited the NodeBindings::PrepareMessageLoop we still ended up calling NodeBindings::RunMessageLoop on reload which caused an incorrect semaphore count.

The semaphore dance works like this https://github.com/libuv/libuv/blob/4296fec7f50145e2a307f3db7ae22984713976a7/src/win/thread.c#L309-L333,

In Main thread
  PrepareMessageLoop embed_sem_ // initialized with value 0, any value greater > 0 will signal the thread waiting on this semahore https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-createsemaphorea

In Background thread
  EmbedThreadRunner waits on embed_sem_ // once this gets signalled, the count will be reduced by 1 which will bring it back to 0 https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitforsingleobject

In Main Thread
 UvRunOnce posts to embed_sem_ // increases the count by 1 and signals the thread https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-releasesemaphore

With https://github.com/electron/electron/pull/27582 there can be situation with active I/O tasks on reload, the semaphore value on main thread would already be incremented by 1 to signal action to the background thread and another call to NodeBindings::RunMessageLoop would increment it now to a value of 2. And even if the background thread now decrements it after wait call, it will be signalled forever inside loop as the value is not 0 and we end up in situation were two threads are operating on the same IOCP and things get broken. The fix here is to simply not call the NodeBindings::RunMessageLoop along with the NodeBindings::PrepareMessageLoop on reload and by maintaining a single polling thread it fixes both the bug mentioned in this issue and as well https://github.com/electron/electron/issues/28154

deepak1556 commented 2 years ago

I will put up a PR in Electron to further discuss this solution.

rchiodo commented 2 years ago

/verified

Definitely working for me. Although I haven't repro'd this in while.