kserve / kserve

Standardized Serverless ML Inference Platform on Kubernetes
https://kserve.github.io/website/
Apache License 2.0
3.54k stars 1.05k forks source link

Facing KeyError: 0 when trying to load model from S3 URI #2818

Open sagarnildass opened 1 year ago

sagarnildass commented 1 year ago

/kind bug

What steps did you take and what happened: Hi I have a kubeflow server running on EKS on AWS. I am trying to replicate the Iris example for Kserve. I have done this tutorial step by step: https://awslabs.github.io/kubeflow-manifests/docs/component-guides/kserve/tutorial/. I am using a cognito + rds + s3 setup.

This is my model training code:

import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
import joblib
#import sklearn.externals.joblib as joblib

# load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# fit a linear regression model
reg = linear_model.LogisticRegression()
reg.fit(X_train, y_train)

# save the model to a file
joblib.dump(reg, 'model.joblib')

Now I have uploaded this model to an S3 bucket: s3://kubeflow-artelus/iris.

This is my s3_secret.yaml file where I am creating the secret and the service account:

apiVersion: v1
kind: Secret
metadata:
  name: kfs-serving-secret
  namespace: sagarnildass
  annotations:
     serving.kubeflow.org/s3-endpoint: s3.amazonaws.com
     serving.kubeflow.org/s3-usehttps: "1"
     serving.kubeflow.org/s3-region: us-east-2

type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "my-access-key-id"
  AWS_SECRET_ACCESS_KEY: "my-secret-access-key"

---

apiVersion: v1
kind: ServiceAccount
metadata:
  name: kfs-serving-sa
  namespace: sagarnildass
secrets:
  - name: kfs-serving-secret

And here's my inferenceService.yaml file

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
  namespace: sagarnildass
spec:
  predictor:
    serviceAccountName: kfs-serving-sa
    model:
      modelFormat:
        name: sklearn
      storageUri: "s3://kubeflow-artelus/iris"

When I apply these files with kubectl, I am getting an error in the pod logs:

message: |
          [I 230412 13:37:26 storage:54] Copying contents of /mnt/models to local
          Traceback (most recent call last):
            File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
              "__main__", mod_spec)
            File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
              exec(code, run_globals)
            File "/sklearnserver/sklearnserver/__main__.py", line 35, in <module>
              model.load()
            File "/sklearnserver/sklearnserver/model.py", line 43, in load
              self._model = joblib.load(existing_paths[0])
            File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 587, in load
              obj = _unpickle(fobj, filename, mmap_mode)
            File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 506, in _unpickle
              obj = unpickler.load()
            File "/usr/local/lib/python3.7/pickle.py", line 1088, in load
              dispatch[key[0]](self)
          KeyError: 0

The moment I change the s3 URL to a "gs" url, the whole thing starts working.

Some of the other things I tried:

  1. Include
annotations:
    sidecar.istio.io/inject: "false"

in the InferenceService.yaml

  1. Setting proxy.holdApplicationUntilProxyStarts: true in istio-sidecar-injector configmap

What did you expect to happen:

I expected the S3 URI to also work so that I could successfully deploy the model.

What's the InferenceService yaml: [To help us debug please run kubectl get isvc $name -n $namespace -oyaml and paste the output]


kind: InferenceService
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"serving.kserve.io/v1beta1","kind":"InferenceService","metadata":{"annotations":{},"name":"sklearn-iris","namespace":"sagarnildass"},"spec":{"predictor":{"model":{"modelFormat":{"name":"sklearn"},"storageUri":"s3://kubeflow-artelus/iris"},"serviceAccountName":"kfs-serving-sa"}}}
  creationTimestamp: "2023-04-12T13:35:49Z"
  finalizers:
  - inferenceservice.finalizers
  generation: 1
  managedFields:
  - apiVersion: serving.kserve.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        .: {}
        f:predictor:
          .: {}
          f:model:
            .: {}
            f:modelFormat:
              .: {}
              f:name: {}
            f:storageUri: {}
          f:serviceAccountName: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2023-04-12T13:35:46Z"
  - apiVersion: serving.kserve.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .: {}
          v:"inferenceservice.finalizers": {}
    manager: manager
    operation: Update
    time: "2023-04-12T13:35:49Z"
  - apiVersion: serving.kserve.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        .: {}
        f:components:
          .: {}
          f:predictor:
            .: {}
            f:latestCreatedRevision: {}
        f:conditions: {}
    manager: manager
    operation: Update
    subresource: status
    time: "2023-04-12T13:35:52Z"
  name: sklearn-iris
  namespace: sagarnildass
  resourceVersion: "1530371"
  uid: d188af1a-deb1-4baa-b626-ab3fef4102d1
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      name: ""
      resources: {}
      storageUri: s3://kubeflow-artelus/iris
    serviceAccountName: kfs-serving-sa
status:
  components:
    predictor:
      latestCreatedRevision: sklearn-iris-predictor-default-00001
  conditions:
  - lastTransitionTime: "2023-04-12T13:45:54Z"
    message: |-
      Revision "sklearn-iris-predictor-default-00001" failed with message: Container failed with: [I 230412 13:41:46 storage:54] Copying contents of /mnt/models to local
      Traceback (most recent call last):
        File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
          "__main__", mod_spec)
        File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
          exec(code, run_globals)
        File "/sklearnserver/sklearnserver/__main__.py", line 35, in <module>
          model.load()
        File "/sklearnserver/sklearnserver/model.py", line 43, in load
          self._model = joblib.load(existing_paths[0])
        File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 587, in load
          obj = _unpickle(fobj, filename, mmap_mode)
        File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 506, in _unpickle
          obj = unpickler.load()
        File "/usr/local/lib/python3.7/pickle.py", line 1088, in load
          dispatch[key[0]](self)
      KeyError: 0
      .
    reason: RevisionFailed
    severity: Info
    status: "False"
    type: PredictorConfigurationReady
  - lastTransitionTime: "2023-04-12T13:45:54Z"
    message: Configuration "sklearn-iris-predictor-default" does not have any ready
      Revision.
    reason: RevisionMissing
    status: "False"
    type: PredictorReady
  - lastTransitionTime: "2023-04-12T13:45:54Z"
    message: Configuration "sklearn-iris-predictor-default" does not have any ready
      Revision.
    reason: RevisionMissing
    severity: Info
    status: "False"
    type: PredictorRouteReady
  - lastTransitionTime: "2023-04-12T13:45:54Z"
    message: Configuration "sklearn-iris-predictor-default" does not have any ready
      Revision.
    reason: RevisionMissing
    status: "False"
    type: Ready

**Anything else you would like to add:**
[Miscellaneous information that will assist in solving the issue.]

**Environment:**

- Istio Version:
- Knative Version:
- KServe Version:
- Kubeflow version:
- Cloud Environment:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm]
- Minikube/Kind version:
- Kubernetes version: (use `kubectl version`):
- OS (e.g. from `/etc/os-release`):
papagala commented 1 year ago

I have the exact same issue. Let me know if you found a workaround.

Wercurial commented 1 year ago

Hi, you can try the latest version of kserve. This issue should have been resolved in this PR #2252 @sagarnildass @papagala

sagarnildass commented 1 year ago

@Wercurial Thank you! I will surely check it out and give my feedback.

Wercurial commented 1 year ago

@Wercurial Thank you! I will surely check it out and give my feedback.

You're welcome. I hope it can help you

112358fn commented 9 months ago

TL;DR: Your model has been serialized with a newer version of joblib than the version that is used by kserve/sklearnserver. To fix it: serialized(dump) your model with joblib<1.0


I was getting the exact same error in KServe version is 0.8.0:

Copying contents of /mnt/models to local
          Traceback (most recent call last):
            File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
              "__main__", mod_spec)
            File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
              exec(code, run_globals)
            File "/sklearnserver/sklearnserver/__main__.py", line 35, in <module>
              model.load()
            File "/sklearnserver/sklearnserver/model.py", line 43, in load
              self._model = joblib.load(existing_paths[0])
            File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 587, in load
              obj = _unpickle(fobj, filename, mmap_mode)
            File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 506, in _unpickle
              obj = unpickler.load()
            File "/usr/local/lib/python3.7/pickle.py", line 1088, in load
              dispatch[key[0]](self)
          KeyError: 0

The reason was that sklearnserver was using an older version of joblib to load the model.

The solution was to use joblib~=0.14 to serialize(dump) the model.

You can reproduce the error from:

self._model = joblib.load(existing_paths[0])
            File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 587, in load
              obj = _unpickle(fobj, filename, mmap_mode)
            File "/usr/local/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 506, in _unpickle
              obj = unpickler.load()
            File "/usr/local/lib/python3.7/pickle.py", line 1088, in load
              dispatch[key[0]](self)
          KeyError: 0

just by using joblib>1.0 to dump and joblib<1.0 to load