Open bcha opened 10 months ago
It looks like the errors you're getting all seem to be from kubernetes? I can't see in the output anything specific to buildkit - there's no metrics that buildkit exposes that should be interfering with this kind of thing.
If this appeared during a kubernetes upgrade, it's likely to have been something to do with that, instead of an issue internal to buildkit?
Thanks, that might very well be. The curious thing is that metrics from all other applications and components except buildkit continue working just fine.
ya, podmetrics are entirely handled by kubernetes controllers. the whole point of podmetrics is that the services running on kubernetes know nothing about them. as for why it's not working for you, the place to start would be to find out what controller you're using for collecting podmetrics (likely metrics-server
) and then checking the logs of that controller
I'm going to close this issue then I think, since it's confirmed not to be a buildkit-specific issue (thanks @nicks!).
@bcha if you find any more details that make it clear that it is actually a buildkit issue, then we can re-open :tada:
@jedevc Yeah so I spent some more time debugging this.
On Bottlerocket nodes when I downgraded to buildkit 0.11.6 the metrics started working fine. Should be easily reproducable. The image tag is the only difference between these two examples:
buildkit v0.11.6 on bottlerocket:
➜ k top pod
NAME CPU(cores) MEMORY(bytes)
buildkit-helmeded-buildkit-service-7b6cdcddb5-mg5dm 3m 10Mi
buildkit v0.12.0 on bottlerocket:
➜ k top pod
error: Metrics not available for pod buildkit-helmeded/buildkit-helmeded-buildkit-service-5bdf4d9664-cpwxg, age: 3m48.164019s
On regular amazon linux nodes buildkit >=0.12.0 works fine, so this seems to be some combination of issues between buildkit, k8s >=1.26 & bottlerocket security hardenings.
I cant seem to find anything relevant in buildkit logs.
Weiiird. Any chance you have a pod spec you could share?
I think it's worth re-opening then, since in your example it looks like you're just changing the buildkit version, and nothing else and then seeing the issue.
I wonder if this could be related to the cgroupsv2 related things we worked on for v0.12, specifically https://github.com/moby/buildkit/pull/4003 or https://github.com/moby/buildkit/pull/3860 (cc @tonistiigi @AkihiroSuda).
Yeah, tell me about it 😁 I suspected the cgroupsv2 a bit myself too earlier, but it was just a hunch & didnt look into it.
Of course, here are pod specs:
v0.12.3:
apiVersion: v1
items:
- apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2023-11-13T10:21:55Z"
generateName: buildkit-helmeded-buildkit-service-886cb8656-
labels:
app.kubernetes.io/instance: buildkit-helmeded
app.kubernetes.io/name: buildkit-service
pod-template-hash: 886cb8656
name: buildkit-helmeded-buildkit-service-886cb8656-dlfr4
namespace: buildkit-helmeded
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: buildkit-helmeded-buildkit-service-886cb8656
uid: 5c31c42d-ad33-40d9-b0d8-678896b7e113
resourceVersion: "1227031494"
uid: 17d558a9-faab-4cc0-b5a6-5cf7ac55cd5f
spec:
containers:
- args:
- --addr
- unix:///run//buildkit/buildkitd.sock
- --addr
- tcp://0.0.0.0:1234
- --debug
image: moby/buildkit:v0.12.3
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- buildctl
- debug
- workers
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: buildkit-service
ports:
- containerPort: 1234
name: tcp
protocol: TCP
readinessProbe:
exec:
command:
- buildctl
- debug
- workers
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-58mm8
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-0-4-163.eu-north-1.compute.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-58mm8
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-11-13T10:21:55Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-11-13T10:22:25Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-11-13T10:22:25Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-11-13T10:21:55Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://be12e385be22615ce91b12565a34a8e1a663404c4c8efb35fb9de8421883758c
image: docker.io/moby/buildkit:v0.12.3
imageID: docker.io/moby/buildkit@sha256:d4187a7326f20d04fafd075f80ccc5d3f8cfd4f665c6e03d158a78e4f64bf3db
lastState: {}
name: buildkit-service
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-11-13T10:22:00Z"
hostIP: 10.0.4.163
phase: Running
podIP: 10.0.1.250
podIPs:
- ip: 10.0.1.250
qosClass: BestEffort
startTime: "2023-11-13T10:21:55Z"
kind: List
metadata:
resourceVersion: ""
v0.11.6:
apiVersion: v1
items:
- apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2023-11-13T11:15:45Z"
generateName: buildkit-helmeded-buildkit-service-5cdf6b4d78-
labels:
app.kubernetes.io/instance: buildkit-helmeded
app.kubernetes.io/name: buildkit-service
pod-template-hash: 5cdf6b4d78
name: buildkit-helmeded-buildkit-service-5cdf6b4d78-bfp5c
namespace: buildkit-helmeded
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: buildkit-helmeded-buildkit-service-5cdf6b4d78
uid: 6e72ec99-03d9-49bc-9c7b-aef59b1f8696
resourceVersion: "1227075325"
uid: caff026d-94d7-405a-8003-21d7192c39c5
spec:
containers:
- args:
- --addr
- unix:///run//buildkit/buildkitd.sock
- --addr
- tcp://0.0.0.0:1234
- --debug
image: moby/buildkit:v0.11.6
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- buildctl
- debug
- workers
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: buildkit-service
ports:
- containerPort: 1234
name: tcp
protocol: TCP
readinessProbe:
exec:
command:
- buildctl
- debug
- workers
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-cdd5v
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-0-16-4.eu-north-1.compute.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-cdd5v
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-11-13T11:15:45Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-11-13T11:16:16Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-11-13T11:16:16Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-11-13T11:15:45Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://89409881904dbc75ebaaa7c03519f30ec3f214c2620c4f7e3aaadfc072b602af
image: docker.io/moby/buildkit:v0.11.6
imageID: docker.io/moby/buildkit@sha256:d6fa89830c26919acba23c5cafa09df0c3ec1fbde20bb2a15ff349e0795241f4
lastState: {}
name: buildkit-service
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2023-11-13T11:15:51Z"
hostIP: 10.0.16.4
phase: Running
podIP: 10.0.28.251
podIPs:
- ip: 10.0.28.251
qosClass: BestEffort
startTime: "2023-11-13T11:15:45Z"
kind: List
metadata:
resourceVersion: ""
I have exactly the same problem with buildkit v0.12.3 and k8 v1.27.8. All other namespaces work fine but buildkit has no pod metrics.
apiVersion: v1
kind: Pod
metadata:
generateName: buildkit-amd64-57fcbc8c94-
labels:
app: buildkitd
pod-template-hash: 57fcbc8c94
name: buildkit-amd64-57fcbc8c94-2m9ch
namespace: buildkit
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: buildkit-amd64-57fcbc8c94
uid: b6172e11-590c-4839-a7fb-eca0d708064b
resourceVersion: "56285397"
uid: 2c42dad3-31fa-45c8-a196-18bf552d604b
spec:
containers:
- args:
- --addr
- unix:///run/buildkit/buildkitd.sock
- --addr
- tcp://0.0.0.0:1234
image: docker.io/moby/buildkit:buildx-stable-1@sha256:d4187a7326f20d04fafd075f80ccc5d3f8cfd4f665c6e03d158a78e4f64bf3db
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- buildctl
- debug
- workers
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: buildkitd
ports:
- containerPort: 1234
protocol: TCP
readinessProbe:
exec:
command:
- buildctl
- debug
- workers
failureThreshold: 3
initialDelaySeconds: 5
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
resources:
requests:
cpu: "6"
memory: 14Gi
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/buildkit
name: buildkit
- mountPath: /etc/buildkit
name: config
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-58vlt
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: nodes-fsn1-6a1f21b83b0e8a35
preemptionPolicy: Never
priority: 1
priorityClassName: normal
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- configMap:
defaultMode: 420
name: buildkit-amd64
name: config
- emptyDir: {}
name: buildkit
- name: kube-api-access-58vlt
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
update: applying the status/needs-investigation
tag until the exact bug is identified.
Hi hopefully this get some attention, its still happening buildkit v0.13.2 and k8s v1.29.3
We ended up locking version to v0.11.6
. Now re-checked this as that pinned version is getting pretty old and has bunch of vulns.. Upgraded buildkit to latest v0.14.1
. Still getting the same metrics issue. Nowdays running k8s v1.30.
Any update on this?
Same issue here
Same issue here
Solved with v0.11.2 😢
Same issue. All pods show metrics just but buildkit pods show no metrics Not sure if useful but here are the relevant metrics from a pod on the same node as buildkit. Notice there's a metric with an "image:" label present here
Same query for buildkit pod. Notice a glaring absense of a metric with moby/buildkit:0.15.2
image label. Only pause
image is present.
Did some more investigation.
k8s version: EKS 1.29
OS version: AL2023 (latest)
arch: AMD64
Using this basic example: https://github.com/moby/buildkit/blob/master/examples/kubernetes/pod.privileged.yaml
Original example:
apiVersion: v1
kind: Pod
metadata:
name: buildkitd
spec:
containers:
- name: buildkitd
image: moby/buildkit:master
readinessProbe:
exec:
command:
- buildctl
- debug
- workers
initialDelaySeconds: 5
periodSeconds: 30
livenessProbe:
exec:
command:
- buildctl
- debug
- workers
initialDelaySeconds: 5
periodSeconds: 30
securityContext:
privileged: true
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl get pods
NAME READY STATUS RESTARTS AGE
buildkitd 1/1 Running 0 8m11s
buildkitd-011 1/1 Running 0 2m37s
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods buildkitd
Error from server (NotFound): podmetrics.metrics.k8s.io "default/buildkitd" not found
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$
Slightly adjusted example (the only thing changed is the buildkit version):
apiVersion: v1
kind: Pod
metadata:
name: buildkitd-011
spec:
containers:
- name: buildkitd
image: moby/buildkit:v0.11.6
readinessProbe:
exec:
command:
- buildctl
- debug
- workers
initialDelaySeconds: 5
periodSeconds: 30
livenessProbe:
exec:
command:
- buildctl
- debug
- workers
initialDelaySeconds: 5
periodSeconds: 30
securityContext:
privileged: true
What do you know it works?!
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods buildkitd-011
NAME CPU(cores) MEMORY(bytes)
buildkitd-011 3m 8Mi
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$
But that's not all. Get this. Rootless works just fine https://github.com/moby/buildkit/blob/master/examples/kubernetes/pod.rootless.yaml
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
buildkitd-011 3m 8Mi
buildkitd-rootless 5m 10Mi
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$
But that's not all. Get this. As soon as securityContext is removed from the pod.privileged.yaml it works as well:
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ diff pod.privileged.yaml pod-test.yaml
4c4,5
< name: buildkitd
---
> name: buildkitd-test
>
7a9
>
8a11
>
25,26d27
< securityContext:
< privileged: true
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods
NAME CPU(cores) MEMORY(bytes)
buildkitd-rootless 2m 13Mi
buildkitd-test 0m 8Mi
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$
Why removing securitycontext privileged works, i have no idea. Other pods with the same context setting seem to work just fine:
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl get pods -n kube-system -o yaml ebs-csi-node-45hwq | grep privil -B2
memory: 40Mi
securityContext:
privileged: true
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods -n kube-system ebs-csi-node-45hwqNAME CPU(cores) MEMORY(bytes)
ebs-csi-node-45hwq 1m 24Mi
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$
Last working version was indeed v0.11.6
. On v0.12.0-rc1
the metrics are not showing.
@jedevc bump on this. I hope the supplied information is enough to get this issue going? It's less than ideal because we cannot see our CPU and memory usage in our builder pods to fine tune our spend. We ended up provisioning giant buildkit pods in order for the builders to have enough cpu/ram
We're having some issues with buildkit pod metrics, roughly ever since k8s 1.25 -> 1.26 upgrade (tho I'm not 100% sure if this is just coincidence). Basically both our datadog agents & metrics-server are having trouble getting pod metrics from buildkit (running v0.12.3).
Output from kubectl top pod:
Not quite sure how to continue debugging this tbh. Every other pod seems to output metrics just fine, even ones on the same nodes as buildkit, so I doubt it's any kind of security group issue etc.