skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.19k stars 426 forks source link

Central coordination for multiple skypilot cli users #3342

Open MattCWheeler opened 3 months ago

MattCWheeler commented 3 months ago

I'm interested in using SkyPilot, but I'd like each member of our team to be able to interact with each cluster regardless of who spun it up.

*Is there any plan to support storing state (the stuff in ~/.sky/.db, etc) in some central place such as an s3 bucket or postgres database?**

To my mind this would be akin to how Terraform is used with multiple team members. Keep state files centrally (s3 is a common choice) and use locks (often kept in dynamodb) to mutex the actions that need to be mutex.

concretevitamin commented 3 months ago

Hello @MattCWheeler, thanks for raising this. There are a few answers to this:

Happy to chat more about any of these. Feel free to ping the dev team on the Slack: https://slack.skypilot.co/

fozziethebeat commented 3 weeks ago

How crazy would it be if I were setup a fixed server with skypilot and any other important configs and use that to trigger jobs within my cloud? That way, as long as all users who log in have shared access to the sky directory, they'll all end up seeing the same state of resources.

This would be a pretty simple short term hack to approximate a client server architecture.

concretevitamin commented 3 weeks ago

@fozziethebeat That's actually a recommended workaround for now! We've seen quite a few deployments using this pattern. Can have users log into that node (zero lift) or build a lightweight http server to wrap commands.

fozziethebeat commented 3 weeks ago

Great! I'll probably give that a shot