ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.16k stars 373 forks source link

[Feature] Ray remote kernels #300

Open Nintorac opened 2 years ago

Nintorac commented 2 years ago

Search before asking

Description

Allow kuberay.ray.io/v1alpha1.Notebook instances to launch remote kernels running in ray itself allowing specification of resource requests in the usual ray way

Use case

I want to make as efficient use of compute resources as possible. In our current situation using sagemaker notebooks there is a lot of waste. If all kernels are shutdown the instance is still using valuable resources that could be better utilized.

Conversely if the instance is shutdown and the user want's to get access to resources the warm up time can be upwards of minutes, this is wasting dev time which is expensive.

Ideally I could start a remote kernel and have it instantly be running (preempting batch jobs if necessary) and as soon as I close it the resources are returned to the cluster.

There should be strong isolation guarantees between different users.

Related issues

103

Are you willing to submit a PR?

Nintorac commented 2 years ago

To enable remote kernels looks fairly straight forward. I think it should be possible to use a ray remote actor to proxy the necessary ports from the ray client to the notebook host.

Then initiating the ray proxy will require wrapping the IPython.kernel call in kernel.json and generating the connection file on the fly, resource requests can then just be part of the args for the IPython.kernel wrapper

https://ipython.org/ipython-doc/dev/development/kernels.html#kernel-specs

Should kernels be specified as part of the kuberay.ray.io/v1alpha1.Notebook spec?

Nintorac commented 2 years ago

Isolation is more tricky I think, ray has namespaces but as far as I know these still share the filesystem so anything written can be read by other ray jobs.

I think the best way may be to rely on k8s and running kernels as their own pods. Then custom add a custom ray resource isolated and have the autoscaler create a pod whose lifetime is tied to the ray function. Maybe this should be broken to another issue though?