Closed bjfletcher closed 1 year ago
Can you provide the Helm chart and Connect versions you're seeing this with? Edit: are you using the templating feature?
Helm chart: 0.3.18 Posit Connect: v2023.03.0 Build v2023.03.0-0-g927f384
As to your latter question, I believe that is how it works - values from the Helm chart eventually go into /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf
but with incorrect outcome (AMD added when it shouldn't).
Sorry, we have all of our Kubernetes experts out at the moment. I'm looking into the issue, but it's taking longer than I would like to learn the ropes.
When @atheriel says "are you using the templating feature", I believe he means a section like this in your helm chart values YAML file:
launcher:
useTemplates: True
launcherKubernetesProfilesConf:
"*":
default-nvidia-gpus: 1
max-nvidia-gpus: 1
The default value is True, so I am pretty sure the answer is yes.
It looks to me like a launcher bug, rather than a problem with the helm charts, although the launcher code in question is much older than the helm charts. The template code that is in use has not changed in roughly 2 years.
Sorry for the delay. Hopefully we can get you more info early next week.
I wasn't able to reproduce this on0.3.18
connect helm chart, as well as some other versions.
I've been setting:
launcher:
launcherKubernetesProfilesConf:
"*":
default-nvidia-gpus: 1
max-nvidia-gpus: 1
in my connect values, and then tunneling into my connect pods to find the following seemingly correct launcher.kubernetes.profiles.conf
contents:
# cat /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf
[*]
default-nvidia-gpus=1
max-nvidia-gpus=1
So no reference to amd
gpus. I'm still trying to track down where this might have been coming from, I suspect this is happening at the early helm templating layer. The templating code in question has been unchanged for a while now, from our oldest to newest launcher job template yamls:
I'm not sure why specifying nvidia
gpus would cause amd
gpu definitions to appear - or why these amd
gpu definitions wouldn't appear for all other cases, if they seemingly appear even when not-specified. I'll keep at it though, we have some more testing we'd like to conduct.
Hey @bjfletcher, for that launcher values block:
launcher:
launcherKubernetesProfilesConf:
"*":
default-nvidia-gpus: 1
max-nvidia-gpus: 1
did you specify any other values, or was this the full block for launcher:
in your values.yaml
file?
Just wondering for my own investigation, in case I'm missing something. I suspect the issue may lie somewhere related to that values block, if the issue is coming from the helm charts.
Ey up @AlexMapley! Thanks so much for looking into this.
I've got some more information that may help.
I've upgraded our Helm chart to 0.4.0. The sidebar in Connect UI now says:
Posit Connect v2023.03.0
Build v2023.03.0-0-g927f384
The values.yml
launcher block:
launcher:
DataDirPVCName: rsc-pvc
enabled: true
namespace: rsc
templateValues:
pod:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: connect.rstudio.com/content-guid
operator: Exists
topologyKey: kubernetes.io/hostname
weight: 100
launcherKubernetesProfilesConf:
"*":
default-nvidia-gpus: 1
max-nvidia-gpus: 1
That is the full launcher block, I've not omitted anything from here. You can refer to support ticket number 88353 regarding the affinity section along with the full values.yml
.
I SSHed into the Connect pod:
$ kubectl exec -it -n rsc rstudio-connect-6cb5748f46-v4fmd -- bash
Defaulted container "connect" out of: connect, exporter
root@rstudio-connect-6cb5748f46-v4fmd:/# cat /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf
[*]
default-nvidia-gpus=1
max-nvidia-gpus=1
which surprised me - it's good news as I expected to see the AMD lines as I mentioned previously, so looking like this problem is now gone.
HOWEVER...
The API content items are still not starting up. So I checked one of the API pods:
$ kubectl -n rsc describe pod run-python-application-7xhdx-w44mj
Name: run-python-application-7xhdx-w44mj
Namespace: rsc
Priority: 0
Node: <none>
Labels: app.kubernetes.io/component=k8s-launcher
app.kubernetes.io/name=rstudio_connect
app.kubernetes.io/version=2023.03.0
connect.rstudio.com/bundle-id=199
connect.rstudio.com/content-guid=17a68aef-1da6-4f52-afba-feb166d1bd4c
connect.rstudio.com/content-id=3
connect.rstudio.com/job-key=HPI1K4Jt57SZzxwV
connect.rstudio.com/job-tag=run_fastapi_app
connect.rstudio.com/python-version=3.9.2
controller-uid=d2930c98-a810-45df-ae4b-fc0c8bf23169
job-name=run-python-application-7xhdx
Annotations: connect.rstudio.com/server-address: https://connect.in.ft.com/
kubernetes.io/psp: eks.privileged
name: run Python application
service_ports: [{"targetPort":3939,"protocol":"TCP"},{"targetPort":50734,"protocol":"TCP"}]
stdin:
{"Environment":[{"name":"CONNECT_SERVER","value":"hCLF3HVz98xwWzGie0J/ncyIQ+5FLVFVmz1fv0zDthxazI4uoXM5FEcCJdll3bOHSEtirKZMgVYjSxlTvdqa8pcJ...
user: rstudio-connect
user_metadata:
{"job":{"annotations":{"connect.rstudio.com/server-address":"https://connect.in.ft.com/"},"labels":{"app.kubernetes.io/component":"k8s-lau...
Status: Pending
IP:
IPs: <none>
Controlled By: Job/run-python-application-7xhdx
Init Containers:
init:
Image: ghcr.io/rstudio/rstudio-connect-content-init:bionic-2023.03.0
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/mnt/rstudio-connect-runtime/ from rsc-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kgvpr (ro)
Containers:
rs-launcher-container:
Image: ghcr.io/rstudio/content-base:r4.0.5-py3.9.2-bionic
Port: <none>
Host Port: <none>
Command:
/opt/rstudio-connect/ext/env-manager
Args:
/opt/rstudio-connect/scripts/with-locale.sh
/opt/rstudio-connect/ext/rsc-session
-d
/opt/rstudio-connect/mnt/job
-a
0.0.0.0:50734
-i
600
-g
2
-b
content-guid:17a68aef-1da6-4f52-afba-feb166d1bd4c
-b
content-id:3
-b
bundle-id:199
/opt/python/3.9.2/bin/python3.9
/opt/rstudio-connect/python/run_app.py
Limits:
amd.com/gpu: 1
nvidia.com/gpu: 1
Requests:
amd.com/gpu: 1
nvidia.com/gpu: 1
Environment:
USER: rstudio-connect
USERNAME: rstudio-connect
LOGNAME: rstudio-connect
HOME: /tmp
TMPDIR: /tmp
Mounts:
/opt/rstudio-connect from rsc-volume (rw)
/opt/rstudio-connect/mnt/app from mount0 (rw,path="apps/3/199")
/opt/rstudio-connect/mnt/job from mount0 (rw,path="jobs/3/HPI1K4Jt57SZzxwV")
/opt/rstudio-connect/mnt/python-environments from mount0 (ro,path="python-environments")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kgvpr (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
mount0:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: rsc-pvc
ReadOnly: false
rsc-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-kgvpr:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: amd.com/gpu:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 2m48s cluster-autoscaler pod didn't trigger scale-up:
Warning FailedScheduling 15s (x3 over 2m50s) default-scheduler 0/4 nodes are available: 4 Insufficient amd.com/gpu.
That tells me a few things.
Limits
, Resources
and Tolerations
sections suggest that configuration included requirements for AMD nodesOne ting that may be useful is the AMI we're using is:
name: amazon-eks-gpu-node-1.23-v20230304 ID: ami-014452f3ad6f7d021 (it's public image, so you should be able to access it)
The machine type we're using is g4dn.xlarge
.
If you & Posit use a different machine type (for GPU) and/or a different AMI, let me know and we'll switch :)
Thanks @bjfletcher that's very helpful! I've been updating my teams environment to follow your setup, I'll post some more notes here soon.
I ran something a similar setup @bjfletcher and always ended up with:
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
as my launcher container requests/limits - i haven't been able to reproduce the issue.
I've traced through the helm and lower level generic templating code - I couldn't find any bugs there yet. What I suspect is that connect is somehow submitting a bad manifest to the launcher api, one that includes requests for both amd
and nvidia
gpus.
It is possible amd
gpus are being specified elsewhere, outside our helm values, and not from /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf
. If you have the json/yaml configuration of that run-python-application-7xhdx-w44mj
app that might hold the key, I'm wondering if we're somehow getting an extra amd gpu requested at that layer.
Ey up @AlexMapley! When you said JSON/YAML configuration of the pod, did you mean the output from kubectl -n rsc describe pod run-python-application-7xhdx-w44mj
? (The full output is in my last comment) But I think you're after something different right? Do you have the command you'd like me to run? Cheers, Ben
I have tested this in the latest release and can confirm that this is a bug in the job-launcher. Using the following values in Connect's values file:
launcher:
enabled: true
launcherKubernetesProfilesConf:
"*":
default-nvidia-gpus: 1
max-nvidia-gpus: 1
And running any piece of content produces a pod with the following requests/limits:
Limits:
amd.com/gpu: 1
nvidia.com/gpu: 1
Requests:
amd.com/gpu: 1
nvidia.com/gpu: 1
The values in the config file are correct though:
root@connect-rstudio-connect-7fdbc68544-l7r4n:/opt/rstudio-connect# cat /etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf
[*]
default-nvidia-gpus=1
max-nvidia-gpus=1
@dbkegley I see the cause of this Launcher side. Do you want me to ship you a patch build to try?
Yes please, thanks @kfeinauer. I'll go ahead and confirm the fix but it's unlikely this will make it in to the 2023.05 release of Connect. I'll make a note that this is something we could address if there's a patch release for 2023.05.
I have just verified that this will be fixed in the next release of Connect v2023.05.0
.
The following config:
launcher:
launcherKubernetesProfilesConf:
"*":
default-nvidia-gpus: 1
max-nvidia-gpus: 1
yields the expected requests and limits for content pods:
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
I can confirm this has indeed been fixed. Nice one @dbkegley & team 🎉
Ey up!
We're trying to get GPU to work with Posit Connect. This is of the Kubernetes flavour, with AWS's EKS.
So we've developed an app that uses CUDA (NVIDIA) code through PyTorch.
The machine type with the AWS EKS is
g4dn
, which has been setup with AMI that supports both EKS and GPU.Posit documentation isn't 100% clear but my understanding is that the way to ask Posit Connect to use GPU is through this configuration in the Helm chart:
However with the app deployment, we're seeing this error in the pod logs:
0/4 nodes are available: 4 Insufficient amd.com/gpu.
which perplexed me, because we've not asked for AMD but NVIDIA, so I checked the manifest for the pod, and saw this:
I "SSHed" into the pod and looked in
/etc/rstudio-connect/launcher/launcher.kubernetes.profiles.conf
:I tried explicitly asking that AMD be excluded with:
however they still appeared with 1 and not 0:
I tested outside of Posit Connect by creating a pod manifest ourselves:
and it worked great.
I tried to find where the manifest gets generated, however it seemed to me that this particular piece wasn't open source and therefore I was at a loss as to how to get this working. :( I wonder whether there was a bug somewhere with the manifest generator that accidentally produced the AMD config alongside NVIDIA?
All the best,
Ben