unionai-oss / deploy-flyte

A set of IaC artifacts to automatically configure the infrastructure resources needed by a Flyte deployment
Apache License 2.0
15 stars 15 forks source link

Nodes - Pods Tolerations #4

Closed uriafranko closed 9 months ago

uriafranko commented 9 months ago

After getting a few things to work it seems that the tolerations of the nodes include: flyte.org/node-role = worker but the pods scheduled by flyte do not include them, any idea how to fix it?

CleanShot 2023-09-12 at 11 50 29

uriafranko commented 9 months ago

Adding the following lines helped...

default-tolerations:
  - key: 'flyte.org/node-role'
    operator: 'Equal'
    value: 'worker'
    effect: 'NoSchedule'

Ended up at:

CleanShot 2023-09-12 at 12 27 32@2x

uriafranko commented 9 months ago

Fixed after manually installing nvidia-device-plugin in the cluster with those settings:

tolerations:
  - key: 'nvidia.com/gpu'
    effect: 'NoSchedule'
    value: 'present'
  - key: 'flyte.org/node-role'
    operator: 'Equal'
    value: 'worker'
    effect: 'NoSchedule'

Now the nvidia/gpu is exposed but for some reason, the nvidia pods try to init on non-gpu nodes aswell...

davidmirror-ops commented 9 months ago

@uriafranko Fixed. I changed the taints map on the eks module. Labels are still there, they are maybe useful for filtering, also GPU taints are still there, but no flyte-node=worker .

Thanks

qchenevier commented 9 months ago

Fixed after manually installing nvidia-device-plugin in the cluster with those settings:

tolerations:
  - key: 'nvidia.com/gpu'
    effect: 'NoSchedule'
    value: 'present'
  - key: 'flyte.org/node-role'
    operator: 'Equal'
    value: 'worker'
    effect: 'NoSchedule'

Now the nvidia/gpu is exposed but for some reason, the nvidia pods try to init on non-gpu nodes aswell...

For people facing the same issue, to install the nvidia-device-plugin, it's an additional helm install in the k8s cluster (addtionally to the flyte installation with helm). Here is how to install it with helm.

And in order to avoid having the nvidia-device-plugin being deployed on non-GPU nodes (and being shown as a pod in CrashLoopBackoff), you can add a nodeSelector in the nvidia-device-plugin configuration like this:

nodeSelector: 
  k8s.amazonaws.com/accelerator: nvidia-tesla-t4  # pick a label which is specific to your GPU nodes, to select them

In the end, my config file (called nvidia-device-plugin-values.yaml) looks like that:

nodeSelector: 
  k8s.amazonaws.com/accelerator: nvidia-tesla-t4
tolerations:
  - key: 'nvidia.com/gpu'
    effect: 'NoSchedule'
    value: 'present'
  - key: 'flyte.org/node-role'
    operator: 'Equal'
    value: 'worker'
    effect: 'NoSchedule'

And I triggered the plugin install with that file:

helm install nvdp nvdp/nvidia-device-plugin --version=0.14.1 --namespace nvidia-device-plugin --create-namespace --values=nvidia-device-plugin-values.yaml
davidmirror-ops commented 9 months ago

Thank you @qchenevier! I was wondering if you'd like to add these instructions to the tutorial?