skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 514 forks source link

Refactor `server.controller`: Replace Process Pool with Thread Pool for Better Performance #4137

Open andylizf opened 1 month ago

andylizf commented 1 month ago

Background

Initially, a process pool was used in server.controller to handle replica launching and termination due to limitations with redirecting sys.stdout and sys.stderr in multi-threaded environments. The global nature of sys.stdout caused output conflicts when multiple threads attempted to redirect it simultaneously.

Proposal

With a new approach to manage thread-local output, it's now feasible to use a thread pool. Each thread can independently handle its logging using thread-local storage, avoiding the previous issues of conflicting sys.stdout redirection.

Benefits

Considerations

Feedback is welcome, especially on testing plans and potential edge cases.

          If there are multiple threads being executed at the same time, they will manipulate the same `system.stdout`.
          I found a way to redirect `stdout` for individual threads, and you're absolutely right — `stdout` is indeed global. This means that if each thread tries to modify the `stdout` reference, it won't work because they all end up modifying the same global reference. 
          The solution I came up with keeps the same `stdout` reference, but uses a custom object to track which thread is responsible for the output. This allows each thread to write to its own output stream without affecting others.

Originally posted by @andylizf in https://github.com/skypilot-org/skypilot/issues/4128#issuecomment-2427961898

andylizf commented 1 month ago

@Michaelvll Can you take a look and give feedback on this approach?

cblmemo commented 1 month ago

Lets also test it out in #4128 ;) Making sure everything works fine.