microsoft / terminal

The new Windows Terminal and the original Windows console host, all in the same place!
MIT License
95.97k stars 8.35k forks source link

Conhost hangs when a lot of content is output in a single write #11794

Open j4james opened 3 years ago

j4james commented 3 years ago

Windows Terminal version

n/a

Windows build number

10.0.19041.1348

Other Software

No response

Steps to reproduce

  1. Open a cmd shell in conhost.
  2. Run the C++ example further down in the comments: link.

The Python example below does not reproduce the issue reliably on all systems.

  1. Open a WSL shell in conhost.
  2. Execute the Python script below.
import math
import os
import sys

size = os.popen('stty size', 'r').read().split()
h,w = tuple(int(n) for n in size)
mx = w//2
my = h//2 

def frame(i):
  s = '\033[H'
  for y in range(h):
    for x in range(w):
      dy,dx = y-my,x-mx
      a = math.atan2(dy*2,dx) + math.pi
      c = (int(a/math.pi*127)+i)%256
      s += '\033[38;2;%d;%d;%dm*' % (c,c,c)
  return s

sys.stdout.write('Generating content...\n')
s = '\033[?25l'
for i in range(512):
  s += frame(i)
s += '\033[?25h\033[m\033[H'

sys.stdout.write('Starting animation...\n')
sys.stdout.write(s)

This constructs a little VT animation which it outputs with a single write call.

Expected Behavior

You should see a rotating pattern, a bit like a radar scan. This is what it looks like in MinTTY (using wsltty):

https://user-images.githubusercontent.com/4181424/142741894-fb9637da-6046-4c7b-b4cb-6e0e2efc583f.mp4

You'll see the same sort of thing in XTerm, VTE, Alacritty, Kitty, etc.

Actual Behavior

Conhost displays the first couple of lines of the first frame (if anything), then hangs for a couple of seconds, and eventually just shows the final frame.

Windows Terminal is a little better, in that it doesn't hang, but you still don't see the animation - just the final frame.

I believe the problem is that the buffer is locked for the entire time that the VT parser is processing a write operation, and the render thread can't render anything when it can't obtain the lock. If the write buffer is particularly large (as is the case here), then the renderer may be blocked for quite a long time.

j4james commented 3 years ago

This may seem like a contrived example, and it is, but it's based on a real issue I had when testing my sixel branch with the sixel-bench benchmark. Much like the above animation, it caused the console to hang for a couple of seconds, and eventually just displayed the final frame of the video.

I suppose sixel-bench is really a contrived example too, so not an urgent problem (especially since we don't have sixel yet), but this definitely does feel like a bug - I don't think it should be possible to hang the console like that.

elsaco commented 3 years ago

@j4james this issue could be Windows version specific. You're using Windows 10. On version 10.0.2200 (Windows 11) after setting Windows Console Host as the default terminal application in SUI, and WT restart, this is the output:

https://user-images.githubusercontent.com/3933920/142770292-84d4fdd3-45d9-4fb5-99be-7a8cbc723e6b.mp4

WT version 1.11.2921.0. Same result when WT is the default terminal application. This post is for reference only!

j4james commented 3 years ago

Huh. That's weird. I wouldn't expect the Windows version to have any effect, but I wonder if it has something to do with the version of WSL you're using (I'm on v1). This test relies on Python passing the write call through to conhost as a single output operation. I've tested both Python 2 and 3, so the Python version shouldn't be a factor, but it's possible the WSL version may affect things.

I'll try and rewrite the test case in C, with a direct call to WriteConsole, which should guarantee the buffer won't be split up by some intermediate code.

Thanks for attempting to replicate the issue though.

j4james commented 3 years ago

Here's the equivalent test case rewritten in C++ with a WriteConsole call. But it'll now need to be run from a Windows shell like cmd or PowerShell.

#include <windows.h>
#include <string>
#include <cmath>

void main()
{
  HANDLE handle = GetStdHandle(STD_OUTPUT_HANDLE);

  DWORD mode;
  GetConsoleMode(handle, &mode);
  mode |= ENABLE_VIRTUAL_TERMINAL_PROCESSING;
  SetConsoleMode(handle, mode);

  CONSOLE_SCREEN_BUFFER_INFOEX csbie;
  csbie.cbSize = sizeof(csbie);
  GetConsoleScreenBufferInfoEx(handle, &csbie);

  const auto w = csbie.dwSize.X;
  const auto h = csbie.srWindow.Bottom - csbie.srWindow.Top + 1;
  const auto mx = w/2;
  const auto my = h/2;
  const auto pi = 3.14159265358979323846;

  printf("Generating content...\n");
  std::string s = "\033[?25l";
  for (auto i = 0; i < 512; i++) {
    s += "\033[H";
    for (auto y = 0; y < h; y++) {
      for (auto x = 0; x < w; x++) {
        auto dy = y-my;
        auto dx = x-mx;
        auto a = std::atan2(dy*2,dx) + pi;
        auto c = std::to_string((int(a/pi*127)+i)%256);
        s += "\033[38;2;"+c+";"+c+";"+c+"m*";
      }
    }
  }
  s += "\033[?25h\033[m\033[H";

  printf("Starting animation...\n");
  DWORD written = 0;
  WriteConsoleA(handle, s.c_str(), s.length(), &written, NULL);
}
elsaco commented 3 years ago

@j4james this is the output with the exe version of rotate in 1st tab, and python/wsl in 2nd tab:

https://user-images.githubusercontent.com/3933920/142777127-03c87210-451f-48e9-b9f8-8bd59653d9a9.mp4

Changing default terminal application setting did not make any difference in any tab.

elsaco commented 3 years ago

@j4james when running the python script in pwsh, it works but rendering seems slower compared to wsl test:

https://user-images.githubusercontent.com/3933920/142780410-9fd0496c-7775-4455-8f77-b6b2ae83b0cc.mp4

The same size = ['30', '120'] was used as in the wsl script.

j4james commented 3 years ago

I should have mentioned I didn't expect to see the problem on the Windows version of Python, because I know it handles I/O differently to Linux, and the output gets split into chunks. But judging from your results, it looks like the same thing can apply to Linux.

At least the C++ version confirms the problem, although your system is clearly a lot faster than mine. so you're not seeing much of a delay before it renders the final frame.

elsaco commented 3 years ago

@j4james what is the difference between printf and writeconsole? This is what I've changed after line 38:

  printf("Starting animation...\n");
  printf("%s", s.c_str());

and this is the output:

https://user-images.githubusercontent.com/3933920/142925001-9f36aa29-3356-4642-9323-994fe58a86d9.mp4

zadjii-msft commented 3 years ago

You know, I think @lhecker actually fixed this recently, but I'd assume the fix isn't in 10.0.19041.1348

zadjii-msft commented 3 years ago

gh-11794

lhecker commented 3 years ago

Yeah I had regular freezes in WT of up to 30 seconds, when running the application at high frame rates.

Since it was a heisenbug and only occurred when you weren't observing it (with a debugger) I could never actually proof the reason for the issue. But I believe I could determine it regardless: Locks like SRWLOCK are unfair locks and under certain circumstances it can prevent threads from progressing for a long time.

Based on this theory I changed the buffer lock in WT into a til::ticket_lock and the issue hasn't occurred for me since. Integrating til::ticket_lock into conhost could be slightly more difficult as it uses critical sections and might rely on its support for recursive locking.

j4james commented 3 years ago

You need to test with the C++ example - the Python version doesn't seem to work for everyone (I'm guessing because you're all on a later version of WSL to me). I can reproduce this with a recent version of conhost built from source and a recent version of Windows Terminal built from source.

j4james commented 3 years ago

what is the difference between printf and writeconsole?

@elsaco printf is going to build up the output string in a temporary buffer, which it will periodically flush when full. That results in a bunch of small WriteConsole calls instead of one big one, which defeats the purpose of the test.

j4james commented 3 years ago

Btw, this is what it looks like for me in OpenConsole and Windows Terminal, both built from commit a68b0d4f08e1e8ffd87d73776ed83d45fec16042.

https://user-images.githubusercontent.com/4181424/142938681-57f0cace-f78e-4767-9955-6d67c7232970.mp4

elsaco commented 3 years ago

@j4james if the same rotate.exe is run from a WSL prompt the delay is clearly visible. Also, please notice how the last frame looks. There's about 4-5 sec between Starting animation... and end of test:

https://user-images.githubusercontent.com/3933920/142979310-4bc162e6-6e3a-423b-9c53-0a82cc80f849.mp4

lhecker commented 3 years ago

After moving LockConsole / UnlockConsole into WriteConsoleWImplHelper and limiting the chunk size passed to DoWriteConsole to 2kB:

https://user-images.githubusercontent.com/2256941/143775617-ba4cd95c-579f-4ccc-accb-6b31e98beeef.mp4

After additionally replacing the _csConsoleLock critical section with:

static thread_local ULONG recursionCount = 0;
static til::ticket_lock lock;

...and increasing the chunk size to a more reasonable 128kB (since the ticket lock is now being "fair"):

https://user-images.githubusercontent.com/2256941/143775679-60d160de-9089-44ab-8b93-648661b2c678.mp4

Unfortunately however, chunking the input is likely not a viable option, as some applications rely on the implicit promise that a single write() with VT is processed synchronously. For instance:

https://user-images.githubusercontent.com/2256941/143776195-fc721a4b-816e-4f5c-a1ad-50a210982a85.mp4

j4james commented 3 years ago

Unfortunately however, chunking the input is likely not a viable option, as some applications rely on the implicit promise that a single write() with VT is processed synchronously.

That's a good point. So maybe this is just the caller's responsibility then, and the issue could be closed as "won't-fix". My main concern was the failure of sixel-bench, but it looks like I'm the only own that can reproduce the issue in Python (which is what sixel-bench is using), so it's likely that could just be resolved by an update of Windows or WSL.

lhecker commented 3 years ago

@j4james But at least one improvement will result out of this: As you can see in my video above, even 2kB chunks of data can cause the output to lag in conhost (and ConPTY) if an application is writing VT sufficiently fast. By switching to a fair mutex (til::ticket_lock) we can improve the user experience. I'm going to submit a PR for that soon. 🙂 The performance hit of that seems to be less than 1.5% in my benchmarks.

j4james commented 3 years ago

By switching to a fair mutex (til::ticket_lock)

Just be aware that there might be a few places in conhost where the lock is acquired recursively (I'm fairly certain I've come across some in the past). I think I remember you saying that the ticket_lock wasn't re-entrant, which I assume would be problematic.

I don't think it would be that difficult to refactor the code to avoid recursive locks, but it might be a bit of work trying to track down all the cases that need to be dealt with.

lhecker commented 3 years ago

I don't believe it's all too difficult to make a mutex save for reentrancy. My approach is basically:

static thread_local ULONG recursionCount = 0;
static til::ticket_lock lock;

void LockConsole() {
    const auto rc = ++recursionCount;
    if (rc == 1) {
        lock.lock();
    }
}

void UnlockConsole()
{
    const auto rc = --recursionCount;
    if (rc == 0) {
        lock.unlock();
    }
}

It's missing safety-checks for over- and underflow, but otherwise it works fine.

lhecker commented 3 months ago

I just randomly came across this issue while searching for sixels of all things. I think we can close this now, in particular with the new passthrough support. BTW the animation from the C++ snippet now works perfect (again?)!

j4james commented 3 months ago

@lhecker This still hangs for me in conhost, which it what the issue was originally about. And while it's improved in Windows Terminal, it still has a significant delay before you see anything (assumedly because conhost has to process the buffer first). This is most obvious when running the sixel-bench test.