microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.61k stars 4.04k forks source link

[REQUEST] Run autotuning task with Kubernetes #2908

Open topsy404 opened 1 year ago

topsy404 commented 1 year ago

Can I launch an autotuning task with Kubernetes?

I see the Resource Configuration (multi-node) document. It seems that DeepSpeed will specify the host for each job in autotuning which conflicts with Kubernetes.

The reason I want to use Kubernetes is that the environment(some packages) requirement of each task has conflicts.

Thanks!

loadams commented 1 year ago

Hi @topsy404 - we know other users are using Kubernetes, can you be more specific about the issue you are hitting?

silverlining21 commented 11 months ago

Hi @topsy404 - we know other users are using Kubernetes, can you be more specific about the issue you are hitting?

Hi, If it's possible give me some hints/details on how to run deepspeed task on k8s? Here are some progress I have made:

but I not sure about how run the task on k8s cluster. should I start one pod for each node? how to setup hostfile in k8s situaion?

any suggestion would be appreciated. thanks~~

loadams commented 11 months ago

Hi @silverlining21 -

I haven't done this, but there are a few other issues that have some sample other users things they've done for launching and k8s setup:

https://github.com/microsoft/DeepSpeed/issues/274 https://github.com/microsoft/DeepSpeed/issues/4098