Closed frezbo closed 1 year ago
An example usage with a var file:
{
"cluster_name": "talos-nvidia-test",
"num_control_planes": 1,
"num_workers": 0,
"ami_id": "ami-034f35c36088696a8",
"instance_type_control_plane": "t3.medium",
"config_patch_files_worker": [
"patch.yaml"
],
"extra_tags": {
"Project": "talos-nvidia-test",
"Environment": "ci test",
"Owner": "frezbo"
},
"node_groups": [
{
"name": "nvidia-t4",
"num_instances": 2,
"instance_type": "g4dn.xlarge",
"tags": {
"Type": "nvidia-t4"
}
},
{
"name": "nvidia-a100",
"num_instances": 1,
"instance_type": "p4d.24xlarge",
"tags": {
"Type": "nvidia-a100"
},
"config_patch_files": [
"patch-a100.yaml"
]
}
]
}
patch.yaml
machine:
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
sysctls:
net.core.bpf_jit_harden: 1
install:
extensions:
- image: ghcr.io/frezbo/nvidia-container-toolkit:535.54.03-v1.13.5
- image: ghcr.io/frezbo/nvidia-open-gpu-kernel-modules:535.54.03-v1.5.0-alpha.3-2-gc59245d-dirty
patch-a100.yaml
machine:
install:
extensions:
- image: ghcr.io/frezbo/nvidia-fabricmanager:535.54.03
Generally fine with this, but I think that calling them "node groups" is a bit confusing, being that that's a construct of EKS. "worker groups" feels more appropriate imo.
makes sense I'll update
/m
/m
Make the aws code more modular so that it can be used in CI.