kubernetes-sigs / jobset

JobSet: a k8s native API for distributed ML training and HPC workloads
https://jobset.sigs.k8s.io/
Apache License 2.0
133 stars 44 forks source link

Comprehensive example task running training workload on GPUs using JobSet #429

Open danielvegamyhre opened 7 months ago

danielvegamyhre commented 7 months ago

What would you like to be added: A comprehensive example showing how to run a training workload on GPUs using JobSet. We could have one example per major cloud provider.

Why is this needed: We need more concrete examples to reduce friction of user onboarding. Right now we mostly have toy examples with sleep containers to demonstrate functionality of different features.

uroy-personal commented 7 months ago

/assign

danielvegamyhre commented 6 months ago

@uroy-personal are you still working on these? If not I am going to unassign them so someone else can work on them.

uroy-personal commented 6 months ago

Hi @danielvegamyhre, Yes I am on it. I need to add the example here right?
https://github.com/kubernetes-sigs/jobset/blob/main/docs/concepts/README.md

Also please help me on what content ( example yaml ) to put there. I hope to finish all the open tasks ( assigned to me ) by this week-end.

danielvegamyhre commented 6 months ago

Hi @danielvegamyhre, Yes I am on it. I need to add the example here right? https://github.com/kubernetes-sigs/jobset/blob/main/docs/concepts/README.md

Also please help me on what content ( example yaml ) to put there. I hope to finish all the open tasks ( assigned to me ) by this week-end.

Yes, you can reference some examples in the examples/ directory to help you get started.

danielvegamyhre commented 6 months ago

Also note it would be nice in the provisioning step to show example commands for all 3 major cloud providers (AWS, GCP, Azure)

uroy-personal commented 6 months ago

Also note it would be nice in the provisioning step to show example commands for all 3 major cloud providers (AWS, GCP, Azure)

Thanks. I am working on it. Hope to raise the PR in the next few days.

danielvegamyhre commented 5 months ago

@uroy-personal Just following up, are you still working on this?

uroy-personal commented 5 months ago

Yes @danielvegamyhre , I am on it. I made the changes but found that the above README page removed. Will complete it within this week for sure! Thanks

uroy-personal commented 5 months ago

Good Morning @danielvegamyhre , Started the ball rolling here. So far I have added the examples present in examples/ into the site concepts page. Where to get the example commands for the cloud providers ( GCP, AWS & Azure ) ? Please help. I will modify the PR again.

uroy-personal commented 5 months ago

It seems this issue needs GPU access. Is there a way to get GPU access @danielvegamyhre ?

uroy-personal commented 5 months ago

/unassign

danielvegamyhre commented 5 months ago

@uroy-personal To make this easier, let's not include the steps to provision GPU nodes on each Cloud Provider. Instead, let's just use a generic/placeholder nodeSelector (e.g. your.cloud.provider.com/gpu-type) to indicate to the user this should be replaced.

uroy-personal commented 5 months ago

Thanks @danielvegamyhre , I will have a look and get back at the earliest!

googs1025 commented 4 months ago

/assign Currently I have a gpu environment, but the gpu card is not up to date, but I can maybe try it and see.

googs1025 commented 4 months ago

/assign