skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.47k stars 460 forks source link

[clouds] Support Flux Framework as a backend #3751

Open vsoch opened 1 month ago

vsoch commented 1 month ago

Hi! I'm looking at this guide and trying to figure out where Flux fits. It could traditionally be thought of as an HPC scheduler, but we also have deployments for AWS and Google Cloud that go to raw VMs (e.g., with Terraform for example). I was looking at this issue https://github.com/skypilot-org/skypilot/issues/3072 and it seems to suggest a good starting point - running under Kubernetes. We already have the Flux Operator (I am lead developer on and cloud do development here) that deploys an entire flux cluster for Kubernetes.

Given these different contexts, I'd like to have a quick discussion about which path to pursue first, and given that path, if you could provide some starting tips. I'm guessing if this is to be something under Kubernetes, it wouldn't be considered a new cloud, but some kind of entity that runs under any cloud that supports Kubernetes. Was the start of the slurm work made public anywhere? That could be a good guide for the overall structure. I can guess where slurm might have run into issues.

Thanks for the tips and looking forward to discussion!

romilbhardwaj commented 1 month ago

Hi @vsoch - welcome to SkyPilot! We would love to see community support for Flux.

From my quick read of Flux, looks like the best way forward here would be to natively support Flux as a "cloud" which can provision resources and execute SkyPilot tasks. Some design may be required to figure out how to work with tasks with different setup running on the same cluster (run in containers?) and interactive components in SkyPilot (logs, ssh setup, etc.).

Looking at the flux k8s operator, seems like it might not fit directly with SkyPilot's Kubernetes integration since it relies on the MiniCluster CRD, whereas the SkyPilot integration directly manipulates pods and services. Flux users running on Kubernetes may still be able to use SkyPilot through the "cloud" implementation above?

vsoch commented 1 month ago

@romilbhardwaj gotcha - so would the assumption be that Flux is already running somewhere, and then skypilot would submit jobs to it? Would sky pilot also be deploying the flux cluster? And if so - what methods are typically done to do that (we have been using a combination of operators, cloud SDKs, and Terraform). For some of those we require custom base image builds (which we have private which would need to be made public).

romilbhardwaj commented 1 month ago

@vsoch - yes, exactly. Flux is already provisioned and running, and SkyPilot deploys "tasks" to Flux (flux exec?). This would probably involve invoking the flux python API (if any), or using the flux CLI in the new provisioner that we write for SkyPilot.

Not sure if this is relevant, but for custom images in private we could do something like what we do for Kubernetes.

vsoch commented 1 month ago

Gotcha. So should I first write a provisioner for Flux for SkyPilot? Or should I start with Flux provisioned and start with the execution of tasks to it (I am thinking both would be desired).

romilbhardwaj commented 1 month ago

I think the first step here would be to implement the Flux Provisioner following the guide. Implementing the bootstrap_instances, run_instances and terminate_instances methods should be a good start.

vsoch commented 1 month ago

hey @romilbhardwaj! I'm pretty far along and getting ssh to work. It's saying there is an invalid host identifcation string, which seems to be debian specific:

$ ssh -tt -i /home/vanessa/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W 10.244.0.67:22 sky@localhost -o ProxyCommand='/home/vanessa/.sky/kubernetes-port-forward-proxy-command.sh sky-84b0-vanessa-7a57-0-gkrng'
Warning: Permanently added 'localhost' (ED25519) to the list of known hosts.
SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u3

Invalid SSH identification string.

Higher level question - why are we setting up ssh for kubernetes (and flux) when we can use kubectl exec out of the box?

romilbhardwaj commented 1 month ago

Hmm, which container image are you using? Trying to see if I can replicate this. If it helps, here's how we install and setup SSH on pods:

https://github.com/skypilot-org/skypilot/blob/465d36cabd6771d349af0efc3036ca7070d2d0fb/sky/provision/kubernetes/instance.py#L327-L355

Reason for installing SSH is to keep the developer experience on SkyPilot consistent across clouds and Kubernetes. With SSH our users can easily hook up tooling reliant on SSH (e.g., VSCode remote, rsync) with their SkyPilot clusters.

vsoch commented 1 month ago

My question is more general - if we have a Kubernetes cluster (that could use either a service or kubecl) why is ssh needed at all? As long as we provide the same abstractions (the skycloud ssh) to "something else that works like ssh" I think the setup would work ok.

Hmm, which container image are you using? Trying to see if I can replicate this. If it helps, here's how we install and setup SSH on pods:

I am running all these steps to get the same, it's just when I get to the wait_for_ssh function (and manually test the command generated) it seems to want to work but spits out that error message. I am using the same skypilot image you would provide to Kubernetes, which is an ubuntu or debian base (I see apt-get in there).

romilbhardwaj commented 1 month ago

SkyPilot itself does not need SSH (e.g., for Kubernetes we use kubectl to send control signals to the pods we provision, see KubernetesCommandRunner).

In the Kubernetes case, we still provide SSH as a nice-to-have for a better developer experience. We could implement a SSH-like interface that can be invoked through the CLI, but that would have limited compatibility with other dev tooling (e.g., VSCode remote uses SSH targets, port forwarding over SSH etc).

For flux, if SSH is hard to get to work we could do perhaps without it for now. You could special-case for flux in wait_for_ssh by querying cluster_info.provider_name and substituting SSH check with some other readiness check.

vsoch commented 1 month ago

That's essentially what I've done - it seems like it's an issue with the ssh install in the container. I think likely there is a fix, and I'll open a PR soon or reference a diff so we can at least discuss the design that I have!

vsoch commented 1 month ago

@romilbhardwaj I opened a PR #3777 with some things to discuss! Let me know the best venue for this - I am thinking a combination of the PR discussion plus (if it's allowed) a small amount of time for the next batch-wg meeting, e.g., if we want to talk more about the idea to generalize a "Kubernetes Cloud." I ask now because I'd want to put it on the agenda.

romilbhardwaj commented 1 month ago

Thanks @vsoch - I'll take a look at the PR soon. To start let's discuss the design and implementation on the PR thread. We can then schedule some time in batch-wg (if it's relevant to their agenda) to talk more.