Support for sage maker training and inference instances

skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

https://skypilot.readthedocs.io

Apache License 2.0

6.82k stars 512 forks source link

Support for sage maker training and inference instances #2338

Open varunvummadi opened 1 year ago

varunvummadi commented 1 year ago

EC2 instances of p4de are super hard to get and launch on AWS if sky supports sage makers training and inference instances that would be great

github-actions[bot] commented 11 months ago

This issue was closed because it has been stalled for 10 days with no activity.

mojivalipour commented 7 months ago

Any update on this?

Michaelvll commented 7 months ago

We haven't got the bandwidth to work on this yet, but our recent refactoring for job API #3419 and command runner #3157 should help the support of SageMaker. : )

mojivalipour commented 7 months ago

Thanks for being prompt @Michaelvll. Anything that we can do to help and accelerate this? If you can explain what's the technical barrier and what kind of things need to be done, will be helpful.

asaiacai commented 7 months ago

I've taken a look at this before. I think one big challenge is Sagemaker makes it very obscure on how to directly access on instances via SSH. It's to the point where they have a separate library for setting up ssh on sagemaker instances. I imagine you would need to set up ssh through the sagemaker-ssh-helper so skypilot can do its normal thing.

mojivalipour commented 7 months ago

@asaiacai Thank you for your input. Yeah, that can be challenging, will try to ask the AWS team and will let you know if there was any solution for that.

yangcheng commented 1 month ago

@mojivalipour have you get a response on this? We are in similar boat that we have get quotas on sagemaker, but no GPU on ec2

mojivalipour commented 1 month ago

@yangcheng This is the answer that I've got:

"Sagemaker training is a managed service unlike regular EC2 or Hyperpods. https://github.com/aws-samples/sagemaker-ssh-helper - this is the way to do ssh there. No direct access is allowed"

Let me know if you find a workaround.

yangcheng commented 1 month ago

@mojivalipour We endup use sagemaker training_jobs. Initially it had some issue with huggingface accelerate, but we were able to figure out

chenzikun commented 2 weeks ago

aws希望使用他们的sagemaker，他们提供了微调后的LLM endpoint部署方式，我们也确实这样做了，但是我们还有其他的算力资源可以接入。请问skypilot有兼容sagemaker的后续计划吗；如果可能，我也可以参与