kserve / kserve

Standardized Serverless ML Inference Platform on Kubernetes
https://kserve.github.io/website/
Apache License 2.0
3.45k stars 1.03k forks source link

Kserve Model Deployment issue #3473

Open amacharya opened 6 months ago

amacharya commented 6 months ago

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

Ref - https://kserve.github.io/website/master/admin/kubernetes_deployment/

kind - kubernetes v1.25 Istio - v1.16.0 Cert Manager - v1.14.2 kserve v0.11.0 (deployed kserve without knative)

on my kind k8s cluster.

kubectl get pods -A
NAMESPACE            NAME                                              READY   STATUS    RESTARTS         AGE
cert-manager         cert-manager-cainjector-76bbdd77f7-w4rpk          1/1     Running   4 (106m ago)     22h
cert-manager         cert-manager-cdbc489b6-mnq7d                      1/1     Running   5 (107m ago)     22h
cert-manager         cert-manager-webhook-7ffbff4575-5tlrb             1/1     Running   4 (106m ago)     22h
istio-system         istio-ingressgateway-748fb66b49-2w2wf             1/1     Running   2 (107m ago)     22h
istio-system         istiod-5d74c58fdd-x4br5                           1/1     Running   2 (107m ago)     22h
kserve               kserve-controller-manager-55d7c5685f-7s75b        2/2     Running   6 (107m ago)     22h
kserve               sklearn-iris-example-predictor-776df85f86-n6xvd   1/1     Running   118 (6m5s ago)   21h
kube-system          coredns-565d847f94-7f7nn                          1/1     Running   2 (107m ago)     22h
kube-system          coredns-565d847f94-vqvvk                          1/1     Running   2 (107m ago)     22h
kube-system          etcd-ethan-control-plane                          1/1     Running   2 (107m ago)     22h
kube-system          kindnet-jj7v7                                     1/1     Running   3 (107m ago)     22h
kube-system          kube-apiserver-ethan-control-plane                1/1     Running   2 (107m ago)     22h
kube-system          kube-controller-manager-ethan-control-plane       1/1     Running   2 (107m ago)     22h
kube-system          kube-proxy-vzbkb                                  1/1     Running   2 (107m ago)     22h
kube-system          kube-scheduler-ethan-control-plane                1/1     Running   2 (107m ago)     22h
local-path-storage   local-path-provisioner-684f458cdd-8d4qf           1/1     Running   4 (106m ago)     22h
kubectl -n kserve logs -f sklearn-iris-example-predictor-776df85f86-n6xvd     
INFO:root:Copying contents of /mnt/models to local
ERROR:root:fail to locate model file for model sklearn-iris-example under dir /mnt/models,trying loading from model repository.
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/sklearnserver/sklearnserver/__main__.py", line 42, in <module>
    kserve.ModelServer(registered_models=SKLearnModelRepository(args.model_dir)).start(
  File "/sklearnserver/sklearnserver/sklearn_model_repository.py", line 24, in __init__
    self.load_models()
  File "/kserve/kserve/model_repository.py", line 37, in load_models
    for name in os.listdir(self.models_dir):
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/models'

ref - https://github.com/kserve/kserve/tree/master/docs/samples/multimodelserving/sklearn

What did you expect to happen:

Any help would be appreciated.

What's the InferenceService yaml: [To help us debug please run kubectl get isvc $name -n $namespace -oyaml and paste the output]

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment: kind - kubernetes v1.25

Similar issue ref - https://github.com/kserve/kserve/issues/3082

cc @terrytangyuan @sivanantha321

sivanantha321 commented 6 months ago

That docs is not up to date. Try with this yaml

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris-example"
spec:
  predictor:
    minReplicas: 1
    sklearn:
      protocolVersion: v1
      resources:
        limits:
          cpu: 100m
          memory: 512Mi
        requests:
          cpu: 100m
          memory: 512Mi

We are planning to deprecate trained models. So I don't recommend using it.

DerTiedemann commented 6 months ago

I've run into the same issue today, though the reason being that the MutiatingWebhookConfiguration for the /mutate-pod endpoint provided in the v0.12.0 manifests ignores all namespaces with the control-plane label set. Seems sensible, but I wanted everything needed for Kserve to be in a singular namespace, including the deployed inference services. Therefore either patch the Webhook config to specifically allow the kserve namespace (though i cant recommend it for prod, im just doing this on a test cluster in which i have only access to one namespace) or move the inference service to another namespace.

In general it would be cool if that would be mentioned somewhere, cuz in tightly controlled environments it can happen that everything is in the same namespace.

amacharya commented 6 months ago

I've run into the same issue today, though the reason being that the MutiatingWebhookConfiguration for the /mutate-pod endpoint provided in the v0.12.0 manifests ignores all namespaces with the control-plane label set. Seems sensible, but I wanted everything needed for Kserve to be in a singular namespace, including the deployed inference services. Therefore either patch the Webhook config to specifically allow the kserve namespace (though i cant recommend it for prod, im just doing this on a test cluster in which i have only access to one namespace) or move the inference service to another namespace.

In general it would be cool if that would be mentioned somewhere, cuz in tightly controlled environments it can happen that everything is in the same namespace.

@DerTiedemann - You can follow this - https://kserve.github.io/website/0.11/get_started/

I was able to unblock myself, and it worked for me.

rhuss commented 5 months ago

I've also struggled with that the webhook completely ignoring any pods that are running in the system namespace (`server) without error.

I recommend creating an explicit error instead of ignoring the webhook but still creating the deployment for the iscv runtime. IMO, the error should already happen in the validation webhook of the isvc itself that it does not allow to be deployed in the system namespace.