Better reconnection for nonvnc on challenge restart

zardus commented 1 month ago

We generate a new VNC password when we restart a challenge, and as a result, novnc can't automatically reconnect across challenge starts. This annoyance will get more noticeable as we streamline the UX (e.g., https://github.com/pwncollege/dojo/pull/473). We should probably patch novnc to refresh the page when a reconnection fails (which will also fix the DOS issue).

ConnorNelson commented 1 month ago

This should have already fixed DOS: https://github.com/pwncollege/dojo/blob/89dbadaea108ef4da75006067f8cfde5e7acbbcb/workspace/services/desktop.nix#L7-L18

I think what you are proposing does not make sense as a "noVNC patch". Instead what we should do is put javascript on the top level page that decides where the iframe will point at. Somehow, we need the ability for that javascript to be informed that it should change the iframe location.

For example:

I receive an event that a new workspace has started and initialized; this JS knows the user is interested in the Desktop (because we are on /workspace/desktop)
Start a sweet spinning "loading" animation
Tell the backend to start the desktop service
When that has finished, kill the "loading" animation and set the iframe location

We should do the same thing for all top-level /workspace/* routes.

The tricky thing is receiving that event. We could for instance poll every 30s to see if I have a new workspace (e.g. container id changed and is in an initialized state). But that's gonna be a decent amount of increases requests and also be slow to do what we want: we want the new desktop to switch over as fast as possible. What we probably need to figure out is websockets or SSE. The critical thing is making sure this plays nice by not holding an available flask worker hostage and not holding db thread connections hostage. I believe CTFd actually has some support for this already that we've disabled in the docker-compose: https://github.com/CTFd/CTFd/blob/bbe1fce4f27a290056614241d74d821625aef602/CTFd/api/v1/notifications.py#L151

spencerpogo commented 1 month ago

I would lean towards SSE as it has way less overhead: it's a normal HTTP request that stays open and streams a text-based protocol. Websocket requires an upgrade handshake between the client and server and every message has to be XORed.

I took a look at the CTFd notification logic. A couple important files are the route the clients connect to and the events_manager classes. Here's my understanding of how it works. Redis has a built in pub/sub mechanism. The event manager's job is to spawn a gevent task (basically a thread) to subscribe to redis. When an HTTP request comes through to the SSE endpoint, it returns a generator which is how flask streams stuff. The generator hooks into the event manager, encodes the events in the SSE format, and forwards them to the client.

It seems like this would lock up a flask worker, but I'd have to test to be sure. Maybe the only solution is to add more flask workers.

ConnorNelson commented 1 month ago

I think I also lean towards SSE, especially since we don't have a need for bidirectional communication, and probably even more importantly, CTFd already has logic for this!

I see a gevent.spawn(_listen) here: https://github.com/CTFd/CTFd/blob/8ef0cdd916eab4226956d9d2d70adeccdc89f441/CTFd/utils/events/__init__.py#L100

My surface level understanding is that CTFd is using gevent by default, and we continue to use this in the DOJO, a "lightweight thread", as you mention. We specify 8 workers, but each worker actually also spins up real (linux) threads:

root@ctfd:/opt/CTFd# ps aux | grep gunicorn | grep -v grep | wc -l
9
root@ctfd:/opt/CTFd# ps aux -T | grep gunicorn | grep -v grep | wc -l
57

I think that 9th process is a a supervisor process that manages the 8 workers, maybe? The number of threads fluctuates a bit. My best guess would be that these gevent threads are backed by real threads to some extent; and that regardless by using spawn, we aren't completely taking over the worker's ability to process more requests, but this should probably be researched/tested to confirm.

spencerpogo commented 1 month ago

listen is only called once in init_events. This single lightweight thread that is spawned is responsible for forwarding messages from redis to the entries of the clients dict. The clients dict entries are populated by the generator returned by the request handler.

The question is where flask iterates the generator from.

ConnorNelson commented 1 month ago

I think I see what you're saying.

I think it should be iterating here in the /events endpoint: https://github.com/CTFd/CTFd/blob/8ef0cdd916eab4226956d9d2d70adeccdc89f441/CTFd/events/__init__.py#L16

Which pulls from: https://github.com/CTFd/CTFd/blob/8ef0cdd916eab4226956d9d2d70adeccdc89f441/CTFd/utils/events/__init__.py#L111-L112

I think gevent should be taking care of making sure the worker process is context switching when it gets blocked on that .get().

I can see that gevent is willing to monkey patch SimpleQueues (https://www.gevent.org/api/gevent.monkey.html#gevent.monkey.patch_queue), but this is a regular Queue. Not sure how it allows that to function, but I can't imagine something so core isn't supported, it must be context switching somehow (maybe simply with the assistance of regular threads).

ConnorNelson commented 1 month ago

Just throwing it out there, but I think this idea will also have nice implications for #426, we can get some sweet interfacing if we can cleanly stream solves/awards/etc.

JensHouses commented 1 month ago

we could also use jquery: working: $("iframe").contents().find("#noVNC_status").attr("class") -> "noVNC_status_normal" vs not working: $("iframe").contents().find("#noVNC_status").attr("class") -> "noVNC_status_error noVNC_open"

we could check every 2 seconds on the client if the status is "noVNC_status_error noVNC_open" and if so trigger a reload of the page or get the new iframe src from a specific endpoint.

pwncollege / dojo

Better reconnection for nonvnc on challenge restart #478