shell-pool / shpool

Think tmux, then aim... lower
Apache License 2.0
1.15k stars 20 forks source link

`shpool list` hangs #17

Open ethanpailes opened 4 months ago

ethanpailes commented 4 months ago

I'm moving an issue filed on the google internal bug tracker here. The problem seems to be that shpool list can sometimes hang. Reproduction steps are unknown, but we do have a log file (attached).

shpool.log

ethanpailes commented 4 months ago

I think the relevant bit is

May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.286942Z  INFO ThreadId(01) run:serve: socket got a new connection
May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.287057Z  INFO ThreadId(371) handle_conn{cid=212}: new
May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.287077Z  INFO ThreadId(371) handle_conn{cid=212}:parse_connect_header: new
May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.287108Z  INFO ThreadId(371) handle_conn{cid=212}:parse_connect_header: close time.busy=24.8µs time.idle=5.71µs
May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.287143Z  INFO ThreadId(371) handle_conn{cid=212}:handle_session_message: new
May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.393112Z  INFO ThreadId(01) run:serve: socket got a new connection
May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.393218Z  INFO ThreadId(372) handle_conn{cid=213}: new
May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.393241Z  INFO ThreadId(372) handle_conn{cid=213}:parse_connect_header: new
May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.393254Z  INFO ThreadId(372) handle_conn{cid=213}:parse_connect_header: close time.busy=8.01µs time.idle=5.38µs
May 13 10:17:51 tower.lon.corp.google.com shpool[2894224]: 2024-05-13T09:17:51.393289Z  INFO ThreadId(372) handle_conn{cid=213}:handle_list: new

All the previous shpool list commands completed as normal, and from this point on in the log they start hanging. shpool list needs to take a lock on the global sessions table, and there are some other handlers which also grab that lock. One of them is the handle_session_message handler, which is an internal plumbing handler that handles out-of-band detach as well as resize messages.

Both of those session messages are handled by talking to the shell->client reader thread via some crossbeam channels, so likely what is happening is that we are taking the lock on the shells table then blocking on a channel send or receive with that lock held. This will pretty much entirely lock up the daemon.