jupyter / terminado

Terminals served by tornado websockets
http://terminado.readthedocs.org/en/latest/
BSD 2-Clause "Simplified" License
368 stars 95 forks source link

Terminado hangs on stdin messages over the PTY buffer size for the server OS #183

Closed KoopaKing closed 1 year ago

KoopaKing commented 2 years ago

We are dealing with an issue where large copy/pastes into Terminado (through open terminals created in the Jupyter UI) cause the Jupyter server to hang.

We diagnosed this issue and it seems to be one where Terminado is using PtyProcess.write on the entirety of stdin message, which for copy/pasted text is the entirety of the pasted text. PtyProcess is using blocking writes to a PTY file descriptor. When the size of the write exceeds the max PTY buffer size (which I believe is set at the OS level) the write blocks indefinitely. Because this is happening inside of the Tornado event loop, it hangs the server.

It's not clear to me if this should be considered a bug in Terminado or PtyProcess. I don't see PtyProcess offering a non-blocking write mechanism, so I'm not sure what the best immediate solution is. It seems like PtyProcess could offer non-blocking writes which could be leveraged by Terminado, or Terminado could split input into some reasonably small chunks before passing them through.

KoopaKing commented 2 years ago

I suppose one other thing you could do is split up the data in stdin messages into multiple messages if it's over a certain size. This would likely prevent other issues like this associated with extremely large websocket messages, but would not address the underlying server issue, which could still be triggered by specially crafted connections by someone wishing to perform a denial-of-service on a Jupyter server they have access to, but this is likely not a concern since the attacker already has arbitrary code execution privileges on the server host through Terminado itself.

KoopaKing commented 2 years ago

I've continued investigating this a bit and actually it seems like the problem, while hit when the PTY buffer size is surpassed is not caused just Terminado only using a single message, but I created a branch that splits the writes (on the server side) and still run into this issue.

This seems to indicate that the subprocess spawned by PtyProcess isn't reading and clearing the PTY buffer being written too.

blink1073 commented 2 years ago

Thanks for digging in to this @KoopaKing! I'm happy to review any PRs you open here or in PtyProcess. I don't have bandwidth to dive into the problem myself at the moment.

KoopaKing commented 2 years ago

Sounds good, I actually just found https://github.com/jupyterlab/jupyterlab/issues/12952 which seems to be the same issue.

Currently testing a frontend fix in terminado.js by chunking the messages.

KoopaKing commented 2 years ago

So verified that a frontend chunking does not work. It seems like we need to do something to tell the OS to send the information in the buffer. Still researching this a bit but I think it will either 1) happen automatically if we ensure chunking breaks on newlines, or 2) happen manually when we send the VEOL special character telling the OS to flush the buffer, although I'm still trying to figure out when that is legal (may need to be in conjunction with point 1)

KoopaKing commented 2 years ago

So I have a working fix for this, after all that it seems to be fixed by switching the stdin write to the PtyProcess to using its own thread. The working fix is using Tornado coroutines to do so.

It seems like maybe the PTY buffer that was filling up was sometimes the stdin write buffer, sometimes the stdout read buffer, both of which are using the same Tornado single threaded event loop. When sufficiently large blocking writes occur from Terminado to the subprocess and the subprocess is sending information back to Terminado, there are buffers both ways and Terminado is using the same event loop thread to both write and read.

This whole thing seems highly related to #162

Trying to clean up the fix here soon, but I'm not a Python programmer, so it might be pretty rough.

KoopaKing commented 1 year ago

Sorry about ghosting this for a while, was unexpectedly busy, and figuring out the right testing mechanism evaded me for a bit (again, not a Python programmer normally).

I've created a PR which should fix this issue along with a regression test. Feel free to lmk if there's anything you'd like updated in terms of the approach.

KoopaKing commented 1 year ago

Fixed by #189