project-codeflare / codeflare-sdk

An intuitive, easy-to-use python interface for batch resource requesting, access, job submission, and observation. Simplifying the developer's life while enabling access to high-performance compute resources, either in the cloud or on-prem.
Apache License 2.0
22 stars 39 forks source link

[RFE] Change num_gpus to a dict to support arbitrary accelerators #467

Open Xaenalt opened 4 months ago

Xaenalt commented 4 months ago

Name of Feature or Improvement

I'd like to change from a hardcoding of nvidia.com/gpu to instead having a dict or something of resources. There are other accelerators and it'd be nice to specify them with an arbitrary key/value rather than hardcoding nvidia.com/gpu

Description of Problem the Feature Should Solve

Currently hardcoding nvidia.com/gpu is suboptimal since there are other accelerators, habana.ai/gaudi to name one, but there are other potential resources and accelerators, some possibly even not public. It would be a benefit to usability to specify these additional resources without editing the template.

Describe the Solution You Would Like to See

I'd like to see a constructor something like:

cluster = Cluster(ClusterConfiguration(
    name='raytest',
    namespace='ray-demo',
    num_workers=2,
    min_cpus=8,
    max_cpus=8,
    min_memory=12,
    max_memory=12,
    resources={"habana.ai/gaudi": 1},
    image="quay.io/spryor/ray:synapseai-1.13-torch",
    instascale=False
))

Which would just add the keys/values from the resources variable into the resources requests/limits section. Perhaps an option to set requests/limits separately something like for splitting, but first pass it's totally fine if it's just requests == limits since for hardware devices it's required they be equal

Describe Alternatives You Have Considered

Some alternative format ideas are maybe something like min_resources and max_resources, or a string format like "someresource": "1/2" for request 1 limit 2, etc.

Additional Context

In this case, the request is around Habana Gaudi devices, but the scope exists beyond that

anishasthana commented 4 months ago

cc @Bobbins228

Bobbins228 commented 4 months ago

This sounds like a useful change :+1: