moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.02k stars 1.12k forks source link

podmetrics.metrics.k8s.io from buildkit not found #4407

Open bcha opened 10 months ago

bcha commented 10 months ago

We're having some issues with buildkit pod metrics, roughly ever since k8s 1.25 -> 1.26 upgrade (tho I'm not 100% sure if this is just coincidence). Basically both our datadog agents & metrics-server are having trouble getting pod metrics from buildkit (running v0.12.3).

Output from kubectl top pod:

➜ k top pod buildkit-deployment-575959cf77-rb94w --v=10 -n buildkit
I1108 12:19:10.699666   12557 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList,application/json" -H "User-Agent: kubectl/v1.28.3 (darwin/arm64) kubernetes/a8a1abc" 'https://6F753C08CB5B073408D87E9B6A225BB4.yl4.eu-north-1.eks.amazonaws.com/api'
I1108 12:19:11.455471   12557 round_trippers.go:495] HTTP Trace: DNS Lookup for 6F753C08CB5B073408D87E9B6A225BB4.yl4.eu-north-1.eks.amazonaws.com resolved to [{13.48.241.241 } {13.48.231.68 }]
I1108 12:19:11.467312   12557 round_trippers.go:510] HTTP Trace: Dial to tcp:13.48.241.241:443 succeed
I1108 12:19:11.531419   12557 round_trippers.go:553] GET https://6F753C08CB5B073408D87E9B6A225BB4.yl4.eu-north-1.eks.amazonaws.com/api 200 OK in 831 milliseconds
I1108 12:19:11.531434   12557 round_trippers.go:570] HTTP Statistics: DNSLookup 3 ms Dial 11 ms TLSHandshake 27 ms ServerProcessing 36 ms Duration 831 ms
I1108 12:19:11.531439   12557 round_trippers.go:577] Response Headers:
I1108 12:19:11.531444   12557 round_trippers.go:580]     Cache-Control: no-cache, private
I1108 12:19:11.531448   12557 round_trippers.go:580]     Content-Type: application/json
I1108 12:19:11.531453   12557 round_trippers.go:580]     X-Kubernetes-Pf-Flowschema-Uid: dbb9ff33-f0ad-4827-ae96-c2bbc640e12b
I1108 12:19:11.531456   12557 round_trippers.go:580]     X-Kubernetes-Pf-Prioritylevel-Uid: e2207259-ae05-4d7f-9139-813d664e3a84
I1108 12:19:11.531460   12557 round_trippers.go:580]     Content-Length: 167
I1108 12:19:11.531465   12557 round_trippers.go:580]     Date: Wed, 08 Nov 2023 10:19:11 GMT
I1108 12:19:11.531468   12557 round_trippers.go:580]     Audit-Id: fc4066db-894a-4c08-91b0-ebb3ab3b668a
I1108 12:19:11.531483   12557 request.go:1212] Response Body: {"kind":"APIVersions","versions":["v1"],"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"ip-172-16-110-156.eu-north-1.compute.internal:443"}]}
I1108 12:19:11.531668   12557 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json;g=apidiscovery.k8s.io;v=v2beta1;as=APIGroupDiscoveryList,application/json" -H "User-Agent: kubectl/v1.28.3 (darwin/arm64) kubernetes/a8a1abc" 'https://6F753C08CB5B073408D87E9B6A225BB4.yl4.eu-north-1.eks.amazonaws.com/apis'
I1108 12:19:11.572713   12557 round_trippers.go:553] GET https://6F753C08CB5B073408D87E9B6A225BB4.yl4.eu-north-1.eks.amazonaws.com/apis 200 OK in 40 milliseconds
I1108 12:19:11.572752   12557 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 40 ms Duration 40 ms
I1108 12:19:11.572757   12557 round_trippers.go:577] Response Headers:
I1108 12:19:11.572763   12557 round_trippers.go:580]     Audit-Id: 37e7d8c3-3b56-4014-ba00-ec2fd98b77a7
I1108 12:19:11.572768   12557 round_trippers.go:580]     Cache-Control: no-cache, private
I1108 12:19:11.572773   12557 round_trippers.go:580]     Content-Type: application/json
I1108 12:19:11.572777   12557 round_trippers.go:580]     X-Kubernetes-Pf-Flowschema-Uid: dbb9ff33-f0ad-4827-ae96-c2bbc640e12b
I1108 12:19:11.572781   12557 round_trippers.go:580]     X-Kubernetes-Pf-Prioritylevel-Uid: e2207259-ae05-4d7f-9139-813d664e3a84
I1108 12:19:11.572785   12557 round_trippers.go:580]     Date: Wed, 08 Nov 2023 10:19:11 GMT
I1108 12:19:11.572922   12557 request.go:1212] Response Body: {"kind":"APIGroupList","apiVersion":"v1","groups":[{"name":"apiregistration.k8s.io","versions":[{"groupVersion":"apiregistration.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"apiregistration.k8s.io/v1","version":"v1"}},{"name":"apps","versions":[{"groupVersion":"apps/v1","version":"v1"}],"preferredVersion":{"groupVersion":"apps/v1","version":"v1"}},{"name":"events.k8s.io","versions":[{"groupVersion":"events.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"events.k8s.io/v1","version":"v1"}},{"name":"authentication.k8s.io","versions":[{"groupVersion":"authentication.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"authentication.k8s.io/v1","version":"v1"}},{"name":"authorization.k8s.io","versions":[{"groupVersion":"authorization.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"authorization.k8s.io/v1","version":"v1"}},{"name":"autoscaling","versions":[{"groupVersion":"autoscaling/v2","version":"v2"},{"groupVersion":"autoscaling/v1","version":"v1"}],"preferredVersion":{"groupVersion":"autoscaling/v2","version":"v2"}},{"name":"batch","versions":[{"groupVersion":"batch/v1","version":"v1"}],"preferredVersion":{"groupVersion":"batch/v1","version":"v1"}},{"name":"certificates.k8s.io","versions":[{"groupVersion":"certificates.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"certificates.k8s.io/v1","version":"v1"}},{"name":"networking.k8s.io","versions":[{"groupVersion":"networking.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"networking.k8s.io/v1","version":"v1"}},{"name":"policy","versions":[{"groupVersion":"policy/v1","version":"v1"}],"preferredVersion":{"groupVersion":"policy/v1","version":"v1"}},{"name":"rbac.authorization.k8s.io","versions":[{"groupVersion":"rbac.authorization.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"rbac.authorization.k8s.io/v1","version":"v1"}},{"name":"storage.k8s.io","versions":[{"groupVersion":"storage.k8s.io/v1","version":"v1"},{"groupVersion":"storage.k8s.io/v1beta1","version":"v1beta1"}],"preferredVersion":{"groupVersion":"storage.k8s.io/v1","version":"v1"}},{"name":"admissionregistration.k8s.io","versions":[{"groupVersion":"admissionregistration.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"admissionregistration.k8s.io/v1","version":"v1"}},{"name":"apiextensions.k8s.io","versions":[{"groupVersion":"apiextensions.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"apiextensions.k8s.io/v1","version":"v1"}},{"name":"scheduling.k8s.io","versions":[{"groupVersion":"scheduling.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"scheduling.k8s.io/v1","version":"v1"}},{"name":"coordination.k8s.io","versions":[{"groupVersion":"coordination.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"coordination.k8s.io/v1","version":"v1"}},{"name":"node.k8s.io","versions":[{"groupVersion":"node.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"node.k8s.io/v1","version":"v1"}},{"name":"discovery.k8s.io","versions":[{"groupVersion":"discovery.k8s.io/v1","version":"v1"}],"preferredVersion":{"groupVersion":"discovery.k8s.io/v1","version":"v1"}},{"name":"flowcontrol.apiserver.k8s.io","versions":[{"groupVersion":"flowcontrol.apiserver.k8s.io/v1beta3","version":"v1beta3"},{"groupVersion":"flowcontrol.apiserver.k8s.io/v1beta2","version":"v1beta2"}],"preferredVersion":{"groupVersion":"flowcontrol.apiserver.k8s.io/v1beta3","version":"v1beta3"}},{"name":"getambassador.io","versions":[{"groupVersion":"getambassador.io/v2","version":"v2"},{"groupVersion":"getambassador.io/v1","version":"v1"},{"groupVersion":"getambassador.io/v1beta2","version":"v1beta2"},{"groupVersion":"getambassador.io/v1beta1","version":"v1beta1"},{"groupVersion":"getambassador.io/v3alpha1","version":"v3alpha1"}],"preferredVersion":{"groupVersion":"getambassador.io/v2","version":"v2"}},{"name":"kyverno.io","versions":[{"groupVersion":"kyverno.io/v1","version":"v1"},{"groupVersion":"kyverno.io/v2beta1","version":"v2beta1"},{"groupVersion":"kyverno.io/v1beta1","version":"v1beta1"},{"groupVersion":"kyverno.io/v2alpha1","version":"v2alpha1"},{"groupVersion":"kyverno.io/v1alpha2","version":"v1alpha2"}],"preferredVersion":{"groupVersion":"kyverno.io/v1","version":"v1"}},{"name":"argoproj.io","versions":[{"groupVersion":"argoproj.io/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"argoproj.io/v1alpha1","version":"v1alpha1"}},{"name":"crd.k8s.amazonaws.com","versions":[{"groupVersion":"crd.k8s.amazonaws.com/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"crd.k8s.amazonaws.com/v1alpha1","version":"v1alpha1"}},{"name":"datadoghq.com","versions":[{"groupVersion":"datadoghq.com/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"datadoghq.com/v1alpha1","version":"v1alpha1"}},{"name":"dynatrace.com","versions":[{"groupVersion":"dynatrace.com/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"dynatrace.com/v1alpha1","version":"v1alpha1"}},{"name":"external-secrets.io","versions":[{"groupVersion":"external-secrets.io/v1beta1","version":"v1beta1"},{"groupVersion":"external-secrets.io/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"external-secrets.io/v1beta1","version":"v1beta1"}},{"name":"generators.external-secrets.io","versions":[{"groupVersion":"generators.external-secrets.io/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"generators.external-secrets.io/v1alpha1","version":"v1alpha1"}},{"name":"karpenter.k8s.aws","versions":[{"groupVersion":"karpenter.k8s.aws/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"karpenter.k8s.aws/v1alpha1","version":"v1alpha1"}},{"name":"networking.k8s.aws","versions":[{"groupVersion":"networking.k8s.aws/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"networking.k8s.aws/v1alpha1","version":"v1alpha1"}},{"name":"traefik.containo.us","versions":[{"groupVersion":"traefik.containo.us/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"traefik.containo.us/v1alpha1","version":"v1alpha1"}},{"name":"traefik.io","versions":[{"groupVersion":"traefik.io/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"traefik.io/v1alpha1","version":"v1alpha1"}},{"name":"vpcresources.k8s.aws","versions":[{"groupVersion":"vpcresources.k8s.aws/v1beta1","version":"v1beta1"},{"groupVersion":"vpcresources.k8s.aws/v1alpha1","version":"v1alpha1"}],"preferredVersion":{"groupVersion":"vpcresources.k8s.aws/v1beta1","version":"v1beta1"}},{"name":"wgpolicyk8s.io","versions":[{"groupVersion":"wgpolicyk8s.io/v1alpha2","version":"v1alpha2"}],"preferredVersion":{"groupVersion":"wgpolicyk8s.io/v1alpha2","version":"v1alpha2"}},{"name":"karpenter.sh","versions":[{"groupVersion":"karpenter.sh/v1alpha5","version":"v1alpha5"}],"preferredVersion":{"groupVersion":"karpenter.sh/v1alpha5","version":"v1alpha5"}},{"name":"rbacmanager.reactiveops.io","versions":[{"groupVersion":"rbacmanager.reactiveops.io/v1beta1","version":"v1beta1"}],"preferredVersion":{"groupVersion":"rbacmanager.reactiveops.io/v1beta1","version":"v1beta1"}},{"name":"external.metrics.k8s.io","versions":[{"groupVersion":"external.metrics.k8s.io/v1beta1","version":"v1beta1"}],"preferredVersion":{"groupVersion":"external.metrics.k8s.io/v1beta1","version":"v1beta1"}},{"name":"metrics.k8s.io","versions":[{"groupVersion":"metrics.k8s.io/v1beta1","version":"v1beta1"}],"preferredVersion":{"groupVersion":"metrics.k8s.io/v1beta1","version":"v1beta1"}}]}
I1108 12:19:11.573282   12557 round_trippers.go:466] curl -v -XGET  -H "Accept: application/vnd.kubernetes.protobuf, */*" -H "User-Agent: kubectl/v1.28.3 (darwin/arm64) kubernetes/a8a1abc" 'https://6F753C08CB5B073408D87E9B6A225BB4.yl4.eu-north-1.eks.amazonaws.com/apis/metrics.k8s.io/v1beta1/namespaces/buildkit/pods/buildkit-deployment-575959cf77-rb94w'
I1108 12:19:11.639674   12557 round_trippers.go:553] GET https://6F753C08CB5B073408D87E9B6A225BB4.yl4.eu-north-1.eks.amazonaws.com/apis/metrics.k8s.io/v1beta1/namespaces/buildkit/pods/buildkit-deployment-575959cf77-rb94w 404 Not Found in 66 milliseconds
I1108 12:19:11.639690   12557 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 66 ms Duration 66 ms
I1108 12:19:11.639695   12557 round_trippers.go:577] Response Headers:
I1108 12:19:11.639702   12557 round_trippers.go:580]     Date: Wed, 08 Nov 2023 10:19:11 GMT
I1108 12:19:11.639708   12557 round_trippers.go:580]     X-Kubernetes-Pf-Flowschema-Uid: dbb9ff33-f0ad-4827-ae96-c2bbc640e12b
I1108 12:19:11.639714   12557 round_trippers.go:580]     Content-Type: application/vnd.kubernetes.protobuf
I1108 12:19:11.639719   12557 round_trippers.go:580]     Cache-Control: no-cache, private
I1108 12:19:11.639724   12557 round_trippers.go:580]     Cache-Control: no-cache, private
I1108 12:19:11.639729   12557 round_trippers.go:580]     X-Kubernetes-Pf-Prioritylevel-Uid: e2207259-ae05-4d7f-9139-813d664e3a84
I1108 12:19:11.639734   12557 round_trippers.go:580]     Content-Length: 221
I1108 12:19:11.639740   12557 round_trippers.go:580]     Audit-Id: 20c5f75a-c173-4bcd-8547-9e964422f0f6
I1108 12:19:11.639745   12557 round_trippers.go:580]     Audit-Id: 20c5f75a-c173-4bcd-8547-9e964422f0f6
I1108 12:19:11.639773   12557 request.go:1210] Response Body:
00000000  6b 38 73 00 0a 0c 0a 02  76 31 12 06 53 74 61 74  |k8s.....v1..Stat|
00000010  75 73 12 c4 01 0a 06 0a  00 12 00 1a 00 12 07 46  |us.............F|
00000020  61 69 6c 75 72 65 1a 53  70 6f 64 6d 65 74 72 69  |ailure.Spodmetri|
00000030  63 73 2e 6d 65 74 72 69  63 73 2e 6b 38 73 2e 69  |cs.metrics.k8s.i|
00000040  6f 20 22 62 75 69 6c 64  6b 69 74 2f 62 75 69 6c  |o "buildkit/buil|
00000050  64 6b 69 74 2d 64 65 70  6c 6f 79 6d 65 6e 74 2d  |dkit-deployment-|
00000060  35 37 35 39 35 39 63 66  37 37 2d 72 62 39 34 77  |575959cf77-rb94w|
00000070  22 20 6e 6f 74 20 66 6f  75 6e 64 22 08 4e 6f 74  |" not found".Not|
00000080  46 6f 75 6e 64 2a 4f 0a  2d 62 75 69 6c 64 6b 69  |Found*O.-buildki|
00000090  74 2f 62 75 69 6c 64 6b  69 74 2d 64 65 70 6c 6f  |t/buildkit-deplo|
000000a0  79 6d 65 6e 74 2d 35 37  35 39 35 39 63 66 37 37  |yment-575959cf77|
000000b0  2d 72 62 39 34 77 12 0e  6d 65 74 72 69 63 73 2e  |-rb94w..metrics.|
000000c0  6b 38 73 2e 69 6f 1a 0a  70 6f 64 6d 65 74 72 69  |k8s.io..podmetri|
000000d0  63 73 28 00 32 00 30 94  03 1a 00 22 00           |cs(.2.0....".|
I1108 12:19:11.639993   12557 helpers.go:246] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "podmetrics.metrics.k8s.io \"buildkit/buildkit-deployment-575959cf77-rb94w\" not found",
  "reason": "NotFound",
  "details": {
    "name": "buildkit/buildkit-deployment-575959cf77-rb94w",
    "group": "metrics.k8s.io",
    "kind": "podmetrics"
  },
  "code": 404
}]
Error from server (NotFound): podmetrics.metrics.k8s.io "buildkit/buildkit-deployment-575959cf77-rb94w" not found

Not quite sure how to continue debugging this tbh. Every other pod seems to output metrics just fine, even ones on the same nodes as buildkit, so I doubt it's any kind of security group issue etc.

jedevc commented 10 months ago

It looks like the errors you're getting all seem to be from kubernetes? I can't see in the output anything specific to buildkit - there's no metrics that buildkit exposes that should be interfering with this kind of thing.

If this appeared during a kubernetes upgrade, it's likely to have been something to do with that, instead of an issue internal to buildkit?

bcha commented 10 months ago

Thanks, that might very well be. The curious thing is that metrics from all other applications and components except buildkit continue working just fine.

nicks commented 10 months ago

ya, podmetrics are entirely handled by kubernetes controllers. the whole point of podmetrics is that the services running on kubernetes know nothing about them. as for why it's not working for you, the place to start would be to find out what controller you're using for collecting podmetrics (likely metrics-server) and then checking the logs of that controller

jedevc commented 10 months ago

I'm going to close this issue then I think, since it's confirmed not to be a buildkit-specific issue (thanks @nicks!).

@bcha if you find any more details that make it clear that it is actually a buildkit issue, then we can re-open :tada:

bcha commented 10 months ago

@jedevc Yeah so I spent some more time debugging this.

On Bottlerocket nodes when I downgraded to buildkit 0.11.6 the metrics started working fine. Should be easily reproducable. The image tag is the only difference between these two examples:

buildkit v0.11.6 on bottlerocket:

➜ k top pod
NAME                                                  CPU(cores)   MEMORY(bytes)
buildkit-helmeded-buildkit-service-7b6cdcddb5-mg5dm   3m           10Mi

buildkit v0.12.0 on bottlerocket:

➜ k top pod
error: Metrics not available for pod buildkit-helmeded/buildkit-helmeded-buildkit-service-5bdf4d9664-cpwxg, age: 3m48.164019s

On regular amazon linux nodes buildkit >=0.12.0 works fine, so this seems to be some combination of issues between buildkit, k8s >=1.26 & bottlerocket security hardenings.

I cant seem to find anything relevant in buildkit logs.

jedevc commented 10 months ago

Weiiird. Any chance you have a pod spec you could share?

I think it's worth re-opening then, since in your example it looks like you're just changing the buildkit version, and nothing else and then seeing the issue.

I wonder if this could be related to the cgroupsv2 related things we worked on for v0.12, specifically https://github.com/moby/buildkit/pull/4003 or https://github.com/moby/buildkit/pull/3860 (cc @tonistiigi @AkihiroSuda).

bcha commented 10 months ago

Yeah, tell me about it 😁 I suspected the cgroupsv2 a bit myself too earlier, but it was just a hunch & didnt look into it.

Of course, here are pod specs:

v0.12.3:

apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    creationTimestamp: "2023-11-13T10:21:55Z"
    generateName: buildkit-helmeded-buildkit-service-886cb8656-
    labels:
      app.kubernetes.io/instance: buildkit-helmeded
      app.kubernetes.io/name: buildkit-service
      pod-template-hash: 886cb8656
    name: buildkit-helmeded-buildkit-service-886cb8656-dlfr4
    namespace: buildkit-helmeded
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: buildkit-helmeded-buildkit-service-886cb8656
      uid: 5c31c42d-ad33-40d9-b0d8-678896b7e113
    resourceVersion: "1227031494"
    uid: 17d558a9-faab-4cc0-b5a6-5cf7ac55cd5f
  spec:
    containers:
    - args:
      - --addr
      - unix:///run//buildkit/buildkitd.sock
      - --addr
      - tcp://0.0.0.0:1234
      - --debug
      image: moby/buildkit:v0.12.3
      imagePullPolicy: IfNotPresent
      livenessProbe:
        exec:
          command:
          - buildctl
          - debug
          - workers
        failureThreshold: 3
        initialDelaySeconds: 5
        periodSeconds: 30
        successThreshold: 1
        timeoutSeconds: 1
      name: buildkit-service
      ports:
      - containerPort: 1234
        name: tcp
        protocol: TCP
      readinessProbe:
        exec:
          command:
          - buildctl
          - debug
          - workers
        failureThreshold: 3
        initialDelaySeconds: 5
        periodSeconds: 30
        successThreshold: 1
        timeoutSeconds: 1
      resources: {}
      securityContext:
        privileged: true
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-58mm8
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: ip-10-0-4-163.eu-north-1.compute.internal
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-58mm8
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2023-11-13T10:21:55Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2023-11-13T10:22:25Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2023-11-13T10:22:25Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2023-11-13T10:21:55Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://be12e385be22615ce91b12565a34a8e1a663404c4c8efb35fb9de8421883758c
      image: docker.io/moby/buildkit:v0.12.3
      imageID: docker.io/moby/buildkit@sha256:d4187a7326f20d04fafd075f80ccc5d3f8cfd4f665c6e03d158a78e4f64bf3db
      lastState: {}
      name: buildkit-service
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2023-11-13T10:22:00Z"
    hostIP: 10.0.4.163
    phase: Running
    podIP: 10.0.1.250
    podIPs:
    - ip: 10.0.1.250
    qosClass: BestEffort
    startTime: "2023-11-13T10:21:55Z"
kind: List
metadata:
  resourceVersion: ""

v0.11.6:

apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    creationTimestamp: "2023-11-13T11:15:45Z"
    generateName: buildkit-helmeded-buildkit-service-5cdf6b4d78-
    labels:
      app.kubernetes.io/instance: buildkit-helmeded
      app.kubernetes.io/name: buildkit-service
      pod-template-hash: 5cdf6b4d78
    name: buildkit-helmeded-buildkit-service-5cdf6b4d78-bfp5c
    namespace: buildkit-helmeded
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: buildkit-helmeded-buildkit-service-5cdf6b4d78
      uid: 6e72ec99-03d9-49bc-9c7b-aef59b1f8696
    resourceVersion: "1227075325"
    uid: caff026d-94d7-405a-8003-21d7192c39c5
  spec:
    containers:
    - args:
      - --addr
      - unix:///run//buildkit/buildkitd.sock
      - --addr
      - tcp://0.0.0.0:1234
      - --debug
      image: moby/buildkit:v0.11.6
      imagePullPolicy: IfNotPresent
      livenessProbe:
        exec:
          command:
          - buildctl
          - debug
          - workers
        failureThreshold: 3
        initialDelaySeconds: 5
        periodSeconds: 30
        successThreshold: 1
        timeoutSeconds: 1
      name: buildkit-service
      ports:
      - containerPort: 1234
        name: tcp
        protocol: TCP
      readinessProbe:
        exec:
          command:
          - buildctl
          - debug
          - workers
        failureThreshold: 3
        initialDelaySeconds: 5
        periodSeconds: 30
        successThreshold: 1
        timeoutSeconds: 1
      resources: {}
      securityContext:
        privileged: true
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: kube-api-access-cdd5v
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    nodeName: ip-10-0-16-4.eu-north-1.compute.internal
    preemptionPolicy: PreemptLowerPriority
    priority: 0
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext: {}
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: kube-api-access-cdd5v
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            expirationSeconds: 3607
            path: token
        - configMap:
            items:
            - key: ca.crt
              path: ca.crt
            name: kube-root-ca.crt
        - downwardAPI:
            items:
            - fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
              path: namespace
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2023-11-13T11:15:45Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2023-11-13T11:16:16Z"
      status: "True"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2023-11-13T11:16:16Z"
      status: "True"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2023-11-13T11:15:45Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: containerd://89409881904dbc75ebaaa7c03519f30ec3f214c2620c4f7e3aaadfc072b602af
      image: docker.io/moby/buildkit:v0.11.6
      imageID: docker.io/moby/buildkit@sha256:d6fa89830c26919acba23c5cafa09df0c3ec1fbde20bb2a15ff349e0795241f4
      lastState: {}
      name: buildkit-service
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2023-11-13T11:15:51Z"
    hostIP: 10.0.16.4
    phase: Running
    podIP: 10.0.28.251
    podIPs:
    - ip: 10.0.28.251
    qosClass: BestEffort
    startTime: "2023-11-13T11:15:45Z"
kind: List
metadata:
  resourceVersion: ""
benedikt-bartscher commented 9 months ago

I have exactly the same problem with buildkit v0.12.3 and k8 v1.27.8. All other namespaces work fine but buildkit has no pod metrics.

apiVersion: v1
kind: Pod
metadata:
  generateName: buildkit-amd64-57fcbc8c94-
  labels:
    app: buildkitd
    pod-template-hash: 57fcbc8c94
  name: buildkit-amd64-57fcbc8c94-2m9ch
  namespace: buildkit
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: buildkit-amd64-57fcbc8c94
    uid: b6172e11-590c-4839-a7fb-eca0d708064b
  resourceVersion: "56285397"
  uid: 2c42dad3-31fa-45c8-a196-18bf552d604b
spec:
  containers:
  - args:
    - --addr
    - unix:///run/buildkit/buildkitd.sock
    - --addr
    - tcp://0.0.0.0:1234
    image: docker.io/moby/buildkit:buildx-stable-1@sha256:d4187a7326f20d04fafd075f80ccc5d3f8cfd4f665c6e03d158a78e4f64bf3db
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - buildctl
        - debug
        - workers
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 1
    name: buildkitd
    ports:
    - containerPort: 1234
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - buildctl
        - debug
        - workers
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 30
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: "6"
        memory: 14Gi
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/buildkit
      name: buildkit
    - mountPath: /etc/buildkit
      name: config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-58vlt
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: nodes-fsn1-6a1f21b83b0e8a35
  preemptionPolicy: Never
  priority: 1
  priorityClassName: normal
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - configMap:
      defaultMode: 420
      name: buildkit-amd64
    name: config
  - emptyDir: {}
    name: buildkit
  - name: kube-api-access-58vlt
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
profnandaa commented 9 months ago

update: applying the status/needs-investigation tag until the exact bug is identified.

falmar commented 4 months ago

Hi hopefully this get some attention, its still happening buildkit v0.13.2 and k8s v1.29.3

bcha commented 2 months ago

We ended up locking version to v0.11.6. Now re-checked this as that pinned version is getting pretty old and has bunch of vulns.. Upgraded buildkit to latest v0.14.1. Still getting the same metrics issue. Nowdays running k8s v1.30.

Any update on this?

andresrsanchez commented 1 month ago

Same issue here

andresrsanchez commented 1 month ago

Same issue here

Solved with v0.11.2 😢

dcherniv commented 4 days ago

Same issue. All pods show metrics just but buildkit pods show no metrics Not sure if useful but here are the relevant metrics from a pod on the same node as buildkit. Notice there's a metric with an "image:" label present here image

Same query for buildkit pod. Notice a glaring absense of a metric with moby/buildkit:0.15.2 image label. Only pause image is present. image

dcherniv commented 4 days ago

Did some more investigation. k8s version: EKS 1.29 OS version: AL2023 (latest) arch: AMD64 Using this basic example: https://github.com/moby/buildkit/blob/master/examples/kubernetes/pod.privileged.yaml Original example:

apiVersion: v1
kind: Pod
metadata:
  name: buildkitd
spec:
  containers:
    - name: buildkitd
      image: moby/buildkit:master
      readinessProbe:
        exec:
          command:
            - buildctl
            - debug
            - workers
        initialDelaySeconds: 5
        periodSeconds: 30
      livenessProbe:
        exec:
          command:
            - buildctl
            - debug
            - workers
        initialDelaySeconds: 5
        periodSeconds: 30
      securityContext:
        privileged: true
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl get pods
NAME            READY   STATUS    RESTARTS   AGE
buildkitd       1/1     Running   0          8m11s
buildkitd-011   1/1     Running   0          2m37s
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods buildkitd
Error from server (NotFound): podmetrics.metrics.k8s.io "default/buildkitd" not found
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ 

Slightly adjusted example (the only thing changed is the buildkit version):

apiVersion: v1
kind: Pod
metadata:
  name: buildkitd-011
spec:
  containers:
    - name: buildkitd
      image: moby/buildkit:v0.11.6
      readinessProbe:
        exec:
          command:
            - buildctl
            - debug
            - workers
        initialDelaySeconds: 5
        periodSeconds: 30
      livenessProbe:
        exec:
          command:
            - buildctl
            - debug
            - workers
        initialDelaySeconds: 5
        periodSeconds: 30
      securityContext:
        privileged: true

What do you know it works?!

dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods buildkitd-011
NAME            CPU(cores)   MEMORY(bytes)   
buildkitd-011   3m           8Mi             
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ 

But that's not all. Get this. Rootless works just fine https://github.com/moby/buildkit/blob/master/examples/kubernetes/pod.rootless.yaml

dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods
NAME                 CPU(cores)   MEMORY(bytes)   
buildkitd-011        3m           8Mi             
buildkitd-rootless   5m           10Mi            
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ 

But that's not all. Get this. As soon as securityContext is removed from the pod.privileged.yaml it works as well:

dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ diff pod.privileged.yaml pod-test.yaml 
4c4,5
<   name: buildkitd
---
>   name: buildkitd-test
> 
7a9
> 
8a11
> 
25,26d27
<       securityContext:
<         privileged: true
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ 
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods
NAME                 CPU(cores)   MEMORY(bytes)   
buildkitd-rootless   2m           13Mi            
buildkitd-test       0m           8Mi             
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ 

Why removing securitycontext privileged works, i have no idea. Other pods with the same context setting seem to work just fine:

dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl get pods -n kube-system -o yaml ebs-csi-node-45hwq | grep privil -B2
        memory: 40Mi
    securityContext:
      privileged: true
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ kubectl top pods -n kube-system ebs-csi-node-45hwqNAME                 CPU(cores)   MEMORY(bytes)   
ebs-csi-node-45hwq   1m           24Mi            
dcherniv@lildebbie:~/Documents/personal/git/buildkit/examples/kubernetes$ 

Last working version was indeed v0.11.6. On v0.12.0-rc1 the metrics are not showing.

dcherniv commented 3 days ago

@jedevc bump on this. I hope the supplied information is enough to get this issue going? It's less than ideal because we cannot see our CPU and memory usage in our builder pods to fine tune our spend. We ended up provisioning giant buildkit pods in order for the builders to have enough cpu/ram