Open iameskild opened 1 year ago
@iameskild did we tried passing the guest_accelerator
field toghether with the node_selector
?
Another way would be disabling the check for guest_accelerator
when a certain flag is passed... (we could use the node_selector
as well for this). Another direction would be refactoring both the validation logic as well as the way we pass GPU config over
What if we decoupled the gpu's from the profile section, and made it have its own logic?
GPU:
enabled: true
- profile: ...... # target profile to use GPU
family_type: a2-highgpu-1g|gpu-ampere-a100-x1 # or we can pass a node_selector value instead and we do the logic before sending to terraform
Thanks for the feedback @viniciusdc!
I haven't tried adding guest_accelerator
partly because it wasn't necessary and partly because adding it might have undesired effects.
I think the guest_accelerator
check makes sense but perhaps having another section, as you mentioned, would be helpful as well. Since we only need the nvidia-driver daemonset
to be applied once, what if we just added:
google_cloud_platform:
gpu:
enabled: true
node_groups:
...
As far as I can tell, AWS needs to know the node-group name but GCP just needs the single daemonset
applied.
And has a final thought, why not just have the daemonset
applied for all deployments?
cc @costrouc
@Adam-D-Lewis Do you know if this has been fixed?
We were able to run A100 on nebari, by updating the nvidia drivers (that change was merged into nebari).
I don't remember having an issue with guest accelerators that they describe
Describe the bug
For GCP deployments that want to use A100 (or similar) GPUs, there is no way of making sure the nvidia drivers are installed (via this daemonset).
This is because currently, we are checking if the profile has
guest_accelerator
, then deploying the daemonset mentioned above (see here). Most GPUs on GCP are attached to CPU instances and use thisguest_accelerator
to specify the desired GPU. For GPUs like A100, this is not the case.Expected behavior
We need to make sure the nvidia drivers are installed even for GPUs like A100.
OS and architecture in which you are running Nebari
Ubuntu
How to Reproduce the problem?
On a Nebari cluster running on GCP, add the following profile and try to launch a server:
Command output
No response
Versions and dependencies used.
No response
Compute environment
None
Integrations
No response
Anything else?
No response