[BUG] - GCP GPUs only enabled when set via `guest_accelerator`

iameskild commented 1 year ago

Describe the bug

For GCP deployments that want to use A100 (or similar) GPUs, there is no way of making sure the nvidia drivers are installed (via this daemonset).

This is because currently, we are checking if the profile has guest_accelerator, then deploying the daemonset mentioned above (see here). Most GPUs on GCP are attached to CPU instances and use this guest_accelerator to specify the desired GPU. For GPUs like A100, this is not the case.

Expected behavior

We need to make sure the nvidia drivers are installed even for GPUs like A100.

OS and architecture in which you are running Nebari

Ubuntu

How to Reproduce the problem?

On a Nebari cluster running on GCP, add the following profile and try to launch a server:


google_cloud_platform:
  ...
  node_groups:
    gpu-ampere-a100-x1:
      instance: a2-highgpu-1g      # 1x 40 GB HBM2: Nvidia Ampere A100
      min_nodes: 0
      max_nodes: 1

profiles:
  jupyterlab:
  - display_name: A100 GPU Instance 1x
    description: GPU instance with 12cpu 85GB / 1 Nvidia A100 GPU
    kubespawner_override:
      ...
      node_selector:
        "cloud.google.com/gke-nodepool": "gpu-ampere-a100-x1"

Command output

No response

Versions and dependencies used.

No response

Compute environment

None

Integrations

No response

Anything else?

No response

viniciusdc commented 1 year ago

@iameskild did we tried passing the guest_accelerator field toghether with the node_selector?

viniciusdc commented 1 year ago

Another way would be disabling the check for guest_accelerator when a certain flag is passed... (we could use the node_selector as well for this). Another direction would be refactoring both the validation logic as well as the way we pass GPU config over What if we decoupled the gpu's from the profile section, and made it have its own logic?

GPU:
  enabled: true
  - profile: ...... # target profile to use GPU
    family_type: a2-highgpu-1g|gpu-ampere-a100-x1 # or we can pass a node_selector value instead and we do the logic before sending to terraform

iameskild commented 1 year ago

Thanks for the feedback @viniciusdc!

I haven't tried adding guest_accelerator partly because it wasn't necessary and partly because adding it might have undesired effects.

I think the guest_accelerator check makes sense but perhaps having another section, as you mentioned, would be helpful as well. Since we only need the nvidia-driver daemonset to be applied once, what if we just added:

google_cloud_platform:
  gpu:
    enabled: true
  node_groups:
    ...

As far as I can tell, AWS needs to know the node-group name but GCP just needs the single daemonset applied.

And has a final thought, why not just have the daemonset applied for all deployments?

cc @costrouc

dcmcand commented 9 months ago

@Adam-D-Lewis Do you know if this has been fixed?

Adam-D-Lewis commented 9 months ago

We were able to run A100 on nebari, by updating the nvidia drivers (that change was merged into nebari).

Adam-D-Lewis commented 9 months ago

I don't remember having an issue with guest accelerators that they describe

nebari-dev / nebari