rstudio / helm

Helm Resources for RStudio Products
MIT License
33 stars 28 forks source link

NVIDIA GPU with Posit Connect apps #355

Closed bjfletcher closed 1 year ago

bjfletcher commented 1 year ago

Ey up!

We're trying to get GPU to work with Posit Connect. This is of the Kubernetes flavour, with AWS's EKS.

So we've developed an app that uses CUDA (NVIDIA) code through PyTorch.

The machine type with the AWS EKS is g4dn, which has been setup with AMI that supports both EKS and GPU.

Posit documentation isn't 100% clear but my understanding is that the way to ask Posit Connect to use GPU is through this configuration in the Helm chart:

launcher:
  launcherKubernetesProfilesConf:
    "*":
      default-nvidia-gpus: 1
      max-nvidia-gpus: 1

However with the app deployment, we're seeing this error in the pod logs:

0/4 nodes are available: 4 Insufficient amd.com/gpu.

which perplexed me, because we've not asked for AMD but NVIDIA, so I checked the manifest for the pod, and saw this:

    Limits:
      amd.com/gpu:     1
      nvidia.com/gpu:  1
    Requests:
      amd.com/gpu:     1
      nvidia.com/gpu:  1

I "SSHed" into the pod and looked in /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf:

[*]
default-amd-gpus=0
default-nvidia-gpus=1

I tried explicitly asking that AMD be excluded with:

  launcherKubernetesProfilesConf:
    "*":
      default-nvidia-gpus: 1
      max-nvidia-gpus: 1
      default-amd-gpus: 0 (and 0.0)
      max-amd-gpus: 0 (and 0.0)

however they still appeared with 1 and not 0:

    Limits:
      amd.com/gpu:     1
      nvidia.com/gpu:  1
    Requests:
      amd.com/gpu:     1
      nvidia.com/gpu:  1

I tested outside of Posit Connect by creating a pod manifest ourselves:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: benfletcherft/rstudio-content-base:r4.1.0-py3.9.2-ubuntu1804-cuda11.4.3
      resources:
        limits:
          nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

and it worked great.

I tried to find where the manifest gets generated, however it seemed to me that this particular piece wasn't open source and therefore I was at a loss as to how to get this working. :( I wonder whether there was a bug somewhere with the manifest generator that accidentally produced the AMD config alongside NVIDIA?

All the best,

Ben

atheriel commented 1 year ago

Can you provide the Helm chart and Connect versions you're seeing this with? Edit: are you using the templating feature?

bjfletcher commented 1 year ago

Helm chart: 0.3.18 Posit Connect: v2023.03.0 Build v2023.03.0-0-g927f384

As to your latter question, I believe that is how it works - values from the Helm chart eventually go into /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf but with incorrect outcome (AMD added when it shouldn't).

msarahan commented 1 year ago

Sorry, we have all of our Kubernetes experts out at the moment. I'm looking into the issue, but it's taking longer than I would like to learn the ropes.

When @atheriel says "are you using the templating feature", I believe he means a section like this in your helm chart values YAML file:

launcher:
  useTemplates: True
  launcherKubernetesProfilesConf:
    "*":
      default-nvidia-gpus: 1
      max-nvidia-gpus: 1

The default value is True, so I am pretty sure the answer is yes.

It looks to me like a launcher bug, rather than a problem with the helm charts, although the launcher code in question is much older than the helm charts. The template code that is in use has not changed in roughly 2 years.

Sorry for the delay. Hopefully we can get you more info early next week.

AlexMapley commented 1 year ago

I wasn't able to reproduce this on0.3.18 connect helm chart, as well as some other versions. I've been setting:

launcher:
  launcherKubernetesProfilesConf:
    "*":
      default-nvidia-gpus: 1
      max-nvidia-gpus: 1

in my connect values, and then tunneling into my connect pods to find the following seemingly correct launcher.kubernetes.profiles.conf contents:

# cat /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf
[*]
default-nvidia-gpus=1
max-nvidia-gpus=1

So no reference to amd gpus. I'm still trying to track down where this might have been coming from, I suspect this is happening at the early helm templating layer. The templating code in question has been unchanged for a while now, from our oldest to newest launcher job template yamls:

I'm not sure why specifying nvidia gpus would cause amd gpu definitions to appear - or why these amd gpu definitions wouldn't appear for all other cases, if they seemingly appear even when not-specified. I'll keep at it though, we have some more testing we'd like to conduct.

AlexMapley commented 1 year ago

Hey @bjfletcher, for that launcher values block:

launcher:
  launcherKubernetesProfilesConf:
    "*":
      default-nvidia-gpus: 1
      max-nvidia-gpus: 1

did you specify any other values, or was this the full block for launcher: in your values.yaml file? Just wondering for my own investigation, in case I'm missing something. I suspect the issue may lie somewhere related to that values block, if the issue is coming from the helm charts.

bjfletcher commented 1 year ago

Ey up @AlexMapley! Thanks so much for looking into this.

I've got some more information that may help.

I've upgraded our Helm chart to 0.4.0. The sidebar in Connect UI now says:

Posit Connect v2023.03.0
Build v2023.03.0-0-g927f384

The values.yml launcher block:

launcher:
  DataDirPVCName: rsc-pvc
  enabled: true
  namespace: rsc
  templateValues:
    pod:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: connect.rstudio.com/content-guid
                  operator: Exists
              topologyKey: kubernetes.io/hostname
            weight: 100
  launcherKubernetesProfilesConf:
    "*":
      default-nvidia-gpus: 1
      max-nvidia-gpus: 1

That is the full launcher block, I've not omitted anything from here. You can refer to support ticket number 88353 regarding the affinity section along with the full values.yml.

I SSHed into the Connect pod:

$ kubectl exec -it -n rsc rstudio-connect-6cb5748f46-v4fmd -- bash
Defaulted container "connect" out of: connect, exporter
root@rstudio-connect-6cb5748f46-v4fmd:/# cat /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf

[*]
default-nvidia-gpus=1
max-nvidia-gpus=1

which surprised me - it's good news as I expected to see the AMD lines as I mentioned previously, so looking like this problem is now gone.

HOWEVER...

The API content items are still not starting up. So I checked one of the API pods:

$ kubectl -n rsc describe pod run-python-application-7xhdx-w44mj
Name:           run-python-application-7xhdx-w44mj
Namespace:      rsc
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/component=k8s-launcher
                app.kubernetes.io/name=rstudio_connect
                app.kubernetes.io/version=2023.03.0
                connect.rstudio.com/bundle-id=199
                connect.rstudio.com/content-guid=17a68aef-1da6-4f52-afba-feb166d1bd4c
                connect.rstudio.com/content-id=3
                connect.rstudio.com/job-key=HPI1K4Jt57SZzxwV
                connect.rstudio.com/job-tag=run_fastapi_app
                connect.rstudio.com/python-version=3.9.2
                controller-uid=d2930c98-a810-45df-ae4b-fc0c8bf23169
                job-name=run-python-application-7xhdx
Annotations:    connect.rstudio.com/server-address: https://connect.in.ft.com/
                kubernetes.io/psp: eks.privileged
                name: run Python application
                service_ports: [{"targetPort":3939,"protocol":"TCP"},{"targetPort":50734,"protocol":"TCP"}]
                stdin:
                  {"Environment":[{"name":"CONNECT_SERVER","value":"hCLF3HVz98xwWzGie0J/ncyIQ+5FLVFVmz1fv0zDthxazI4uoXM5FEcCJdll3bOHSEtirKZMgVYjSxlTvdqa8pcJ...
                user: rstudio-connect
                user_metadata:
                  {"job":{"annotations":{"connect.rstudio.com/server-address":"https://connect.in.ft.com/"},"labels":{"app.kubernetes.io/component":"k8s-lau...
Status:         Pending
IP:
IPs:            <none>
Controlled By:  Job/run-python-application-7xhdx
Init Containers:
  init:
    Image:        ghcr.io/rstudio/rstudio-connect-content-init:bionic-2023.03.0
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /mnt/rstudio-connect-runtime/ from rsc-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kgvpr (ro)
Containers:
  rs-launcher-container:
    Image:      ghcr.io/rstudio/content-base:r4.0.5-py3.9.2-bionic
    Port:       <none>
    Host Port:  <none>
    Command:
      /opt/rstudio-connect/ext/env-manager
    Args:
      /opt/rstudio-connect/scripts/with-locale.sh
      /opt/rstudio-connect/ext/rsc-session
      -d
      /opt/rstudio-connect/mnt/job
      -a
      0.0.0.0:50734
      -i
      600
      -g
      2
      -b
      content-guid:17a68aef-1da6-4f52-afba-feb166d1bd4c
      -b
      content-id:3
      -b
      bundle-id:199
      /opt/python/3.9.2/bin/python3.9
      /opt/rstudio-connect/python/run_app.py
    Limits:
      amd.com/gpu:     1
      nvidia.com/gpu:  1
    Requests:
      amd.com/gpu:     1
      nvidia.com/gpu:  1
    Environment:
      USER:      rstudio-connect
      USERNAME:  rstudio-connect
      LOGNAME:   rstudio-connect
      HOME:      /tmp
      TMPDIR:    /tmp
    Mounts:
      /opt/rstudio-connect from rsc-volume (rw)
      /opt/rstudio-connect/mnt/app from mount0 (rw,path="apps/3/199")
      /opt/rstudio-connect/mnt/job from mount0 (rw,path="jobs/3/HPI1K4Jt57SZzxwV")
      /opt/rstudio-connect/mnt/python-environments from mount0 (ro,path="python-environments")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kgvpr (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  mount0:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  rsc-pvc
    ReadOnly:   false
  rsc-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-kgvpr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 amd.com/gpu:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason             Age                  From                Message
  ----     ------             ----                 ----                -------
  Normal   NotTriggerScaleUp  2m48s                cluster-autoscaler  pod didn't trigger scale-up:
  Warning  FailedScheduling   15s (x3 over 2m50s)  default-scheduler   0/4 nodes are available: 4 Insufficient amd.com/gpu.

That tells me a few things.

  1. the events section suggests that, again, it was looking for the AMD nodes when we had only NVIDIA nodes
  2. the Limits, Resources and Tolerations sections suggest that configuration included requirements for AMD nodes

One ting that may be useful is the AMI we're using is:

name: amazon-eks-gpu-node-1.23-v20230304 ID: ami-014452f3ad6f7d021 (it's public image, so you should be able to access it)

The machine type we're using is g4dn.xlarge.

If you & Posit use a different machine type (for GPU) and/or a different AMI, let me know and we'll switch :)

AlexMapley commented 1 year ago

Thanks @bjfletcher that's very helpful! I've been updating my teams environment to follow your setup, I'll post some more notes here soon.

AlexMapley commented 1 year ago

I ran something a similar setup @bjfletcher and always ended up with:

    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1

as my launcher container requests/limits - i haven't been able to reproduce the issue.

I've traced through the helm and lower level generic templating code - I couldn't find any bugs there yet. What I suspect is that connect is somehow submitting a bad manifest to the launcher api, one that includes requests for both amd and nvidia gpus.

It is possible amd gpus are being specified elsewhere, outside our helm values, and not from /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf. If you have the json/yaml configuration of that run-python-application-7xhdx-w44mj app that might hold the key, I'm wondering if we're somehow getting an extra amd gpu requested at that layer.

bjfletcher commented 1 year ago

Ey up @AlexMapley! When you said JSON/YAML configuration of the pod, did you mean the output from kubectl -n rsc describe pod run-python-application-7xhdx-w44mj? (The full output is in my last comment) But I think you're after something different right? Do you have the command you'd like me to run? Cheers, Ben

dbkegley commented 1 year ago

I have tested this in the latest release and can confirm that this is a bug in the job-launcher. Using the following values in Connect's values file:

launcher:
  enabled: true
  launcherKubernetesProfilesConf:
    "*":
      default-nvidia-gpus: 1
      max-nvidia-gpus: 1

And running any piece of content produces a pod with the following requests/limits:

    Limits:
      amd.com/gpu:     1
      nvidia.com/gpu:  1
    Requests:
      amd.com/gpu:     1
      nvidia.com/gpu:  1

The values in the config file are correct though:

root@connect-rstudio-connect-7fdbc68544-l7r4n:/opt/rstudio-connect# cat /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf
[*]
default-nvidia-gpus=1
max-nvidia-gpus=1
kfeinauer commented 1 year ago

@dbkegley I see the cause of this Launcher side. Do you want me to ship you a patch build to try?

dbkegley commented 1 year ago

Yes please, thanks @kfeinauer. I'll go ahead and confirm the fix but it's unlikely this will make it in to the 2023.05 release of Connect. I'll make a note that this is something we could address if there's a patch release for 2023.05.

dbkegley commented 1 year ago

I have just verified that this will be fixed in the next release of Connect v2023.05.0.

The following config:

launcher:
  launcherKubernetesProfilesConf:
    "*":
      default-nvidia-gpus: 1
      max-nvidia-gpus: 1

yields the expected requests and limits for content pods:

    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
dbkegley commented 1 year ago

This is fixed by https://github.com/rstudio/helm/releases/tag/rstudio-connect-0.5.0-rc01

bjfletcher commented 1 year ago

I can confirm this has indeed been fixed. Nice one @dbkegley & team 🎉