Open varunvummadi opened 1 year ago
This issue was closed because it has been stalled for 10 days with no activity.
Any update on this?
We haven't got the bandwidth to work on this yet, but our recent refactoring for job API #3419 and command runner #3157 should help the support of SageMaker. : )
Thanks for being prompt @Michaelvll. Anything that we can do to help and accelerate this? If you can explain what's the technical barrier and what kind of things need to be done, will be helpful.
I've taken a look at this before. I think one big challenge is Sagemaker makes it very obscure on how to directly access on instances via SSH. It's to the point where they have a separate library for setting up ssh on sagemaker instances. I imagine you would need to set up ssh through the sagemaker-ssh-helper so skypilot can do its normal thing.
@asaiacai Thank you for your input. Yeah, that can be challenging, will try to ask the AWS team and will let you know if there was any solution for that.
@mojivalipour have you get a response on this? We are in similar boat that we have get quotas on sagemaker, but no GPU on ec2
@yangcheng This is the answer that I've got:
"Sagemaker training is a managed service unlike regular EC2 or Hyperpods. https://github.com/aws-samples/sagemaker-ssh-helper - this is the way to do ssh there. No direct access is allowed"
Let me know if you find a workaround.
@mojivalipour We endup use sagemaker training_jobs. Initially it had some issue with huggingface accelerate, but we were able to figure out
aws希望使用他们的sagemaker,他们提供了微调后的LLM endpoint部署方式,我们也确实这样做了,但是我们还有其他的算力资源可以接入。请问skypilot有兼容sagemaker的后续计划吗;如果可能,我也可以参与
EC2 instances of p4de are super hard to get and launch on AWS if sky supports sage makers training and inference instances that would be great