kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.51k stars 443 forks source link

standalone installation not usable via Python SDK (unable to load root certificates) #2451

Closed garymm closed 2 weeks ago

garymm commented 2 weeks ago

What happened?

Installed as per the instructions from the docs:

kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.17.0"

Then used the katib python SDK as per the example in the docs. Creating an experiment fails with:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"defaulter.experiment.katib.kubeflow.org\": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"defaulter.experiment.katib.kubeflow.org\": could not get REST client: unable to load root certificates: unable to parse bytes as PEM block"}]},"code":500}

From some related thread on Slack I gather that the MutatingWebhookConfiguration having empty caBundle may be related:

kubectl get MutatingWebhookConfiguration katib.kubeflow.org -o yaml

Outputs:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    cert-manager.io/inject-ca-from: kubeflow/katib-webhook-cert
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"admissionregistration.k8s.io/v1","kind":"MutatingWebhookConfiguration","metadata":{"annotations":{"cert-manager.io/inject-ca-from":"kubeflow/katib-webhook-cert"},"name":"katib.kubeflow.org"},"webhooks":[{"admissionReviewVersions":["v1"],"clientConfig":{"caBundle":"Cg==","service":{"name":"katib-controller","namespace":"kubeflow","path":"/mutate-experiment"}},"name":"defaulter.experiment.katib.kubeflow.org","rules":[{"apiGroups":["kubeflow.org"],"apiVersions":["v1beta1"],"operations":["CREATE","UPDATE"],"resources":["experiments"]}],"sideEffects":"None"},{"admissionReviewVersions":["v1"],"clientConfig":{"caBundle":"Cg==","service":{"name":"katib-controller","namespace":"kubeflow","path":"/mutate-pod"}},"name":"mutator.pod.katib.kubeflow.org","namespaceSelector":{"matchLabels":{"katib.kubeflow.org/metrics-collector-injection":"enabled"}},"objectSelector":{"matchExpressions":[{"key":"katib.kubeflow.org/metrics-collector-injection","operator":"NotIn","values":["disabled"]}]},"rules":[{"apiGroups":[""],"apiVersions":["v1"],"operations":["CREATE"],"resources":["pods"]}],"sideEffects":"None"}]}
  creationTimestamp: "2024-11-07T00:00:59Z"
  generation: 1
  name: katib.kubeflow.org
  resourceVersion: "4380064"
  uid: 2a1ab32a-a96e-4154-a58d-3271ad4bd21d
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    caBundle: Cg==
    service:
      name: katib-controller
      namespace: kubeflow
      path: /mutate-experiment
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: defaulter.experiment.katib.kubeflow.org
  namespaceSelector: {}
  objectSelector: {}
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - kubeflow.org
    apiVersions:
    - v1beta1
    operations:
    - CREATE
    - UPDATE
    resources:
    - experiments
    scope: '*'
  sideEffects: None
  timeoutSeconds: 10
- admissionReviewVersions:
  - v1
  clientConfig:
    caBundle: Cg==
    service:
      name: katib-controller
      namespace: kubeflow
      path: /mutate-pod
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: mutator.pod.katib.kubeflow.org
  namespaceSelector:
    matchLabels:
      katib.kubeflow.org/metrics-collector-injection: enabled
  objectSelector:
    matchExpressions:
    - key: katib.kubeflow.org/metrics-collector-injection
      operator: NotIn
      values:
      - disabled
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    scope: '*'
  sideEffects: None
  timeoutSeconds: 10

What did you expect to happen?

I expect to be able to use the Python SDK after installing Katib standalone.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.6

Katib controller version: 0.17.0

Katib Python SDK version: 0.17.0

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

tenzen-y commented 2 weeks ago

It seems that the certification was not set to webhook configurations appropriately. Could you check the controller state with kubectl get pods -n kubeflow?

garymm commented 2 weeks ago

Ah yeah the controller pod can't run because:

Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  3m11s (x527 over 17h)  kubelet  MountVolume.SetUp failed for volume "cert" : secret "katib-webhook-cert" not found

So it seems a secret needs to be created. Is it possible for the katib-standalone kube configs can handle this? If not then I guess instructions need to be added as to how the user can do this on their own before applying the kube configs.

garymm commented 2 weeks ago

Hmm re-applied and it seems to work now. Not sure what happened the first time. I will close and re-open if I can reproduce.