kserve / kserve

Standardized Serverless ML Inference Platform on Kubernetes
https://kserve.github.io/website/
Apache License 2.0
3.52k stars 1.05k forks source link

Oracle/S3: ConnectionRefusedError #3986

Open Lejboelle opened 2 days ago

Lejboelle commented 2 days ago

/kind bug

What steps did you take and what happened: I'm trying to deploy an inferenceservice using a model stored in an S3 bucket in Oracle Cloud. I followed the documentation and set up credentials and serviceaccount as follows

apiVersion: v1
kind: Secret
metadata:
  name: s3-secret
  annotations:
    serving.kserve.io/s3-endpoint: mynamespace.compat.objectstorage.myregion.oraclecloud.com
    serving.kserve.io/s3-usehttps: "1"
    serving.kserve.io/s3-region: "myregion"
    serving.kserve.io/s3-useanoncredential: "false"
type: Opaque
stringData: # use `stringData` for raw credential string or `data` for base64 encoded string
  AWS_ACCESS_KEY_ID: XXXXXXX
  AWS_SECRET_ACCESS_KEY: XXXXXX
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kserve-sa
secrets:
  - name: s3-secret

My inferenceservice (custom predictor):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  labels:
    app: my-inferenceservice
    component: inferenceservice
  name: my-inferenceservice
spec:
  predictor:
    serviceAccountName: kserve-sa
    containers:
    - args:
      - --model_path
      - /mnt/models/model
      - --model_name
      - my-inferenceservice
      command:
      - python
      - -m
      - model
      env:
      - name: STORAGE_URI
        value: s3://my-bucket/path-to-model
      image: my-registry/my-image:latest
      name: kserve-container

When trying to trying to deploy this service i get an error:

Traceback (most recent call last):
  File "/prod_venv/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/prod_venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/prod_venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/prod_venv/lib/python3.9/site-packages/botocore/httpsession.py", line 465, in send
    urllib_response = conn.urlopen(
  File "/prod_venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 798, in urlopen
    retries = retries.increment(
  File "/prod_venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/prod_venv/lib/python3.9/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/prod_venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "/prod_venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 403, in _make_request
    self._validate_conn(conn)
  File "/prod_venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 1053, in _validate_conn
    conn.connect()
  File "/prod_venv/lib/python3.9/site-packages/urllib3/connection.py", line 363, in connect
    self.sock = conn = self._new_conn()
  File "/prod_venv/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPSConnection object at 0x7efd8d2d8850>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/storage-initializer/scripts/initializer-entrypoint", line 15, in <module>
    Storage.download(src_uri, dest_path)
  File "/kserve/kserve/storage/storage.py", line 83, in download
    Storage._download_s3(uri, out_dir)
  File "/kserve/kserve/storage/storage.py", line 177, in _download_s3
    for obj in bucket.objects.filter(Prefix=bucket_path):
  File "/prod_venv/lib/python3.9/site-packages/boto3/resources/collection.py", line 81, in __iter__
    for page in self.pages():
  File "/prod_venv/lib/python3.9/site-packages/boto3/resources/collection.py", line 171, in pages
    for page in pages:
  File "/prod_venv/lib/python3.9/site-packages/botocore/paginate.py", line 269, in __iter__
    response = self._make_request(current_kwargs)
  File "/prod_venv/lib/python3.9/site-packages/botocore/paginate.py", line 357, in _make_request
    return self._method(**current_kwargs)
  File "/prod_venv/lib/python3.9/site-packages/botocore/client.py", line 534, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/prod_venv/lib/python3.9/site-packages/botocore/client.py", line 959, in _make_api_call
    http, parsed_response = self._make_request(
  File "/prod_venv/lib/python3.9/site-packages/botocore/client.py", line 982, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/prod_venv/lib/python3.9/site-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/prod_venv/lib/python3.9/site-packages/botocore/endpoint.py", line 202, in _send_request
    while self._needs_retry(
  File "/prod_venv/lib/python3.9/site-packages/botocore/endpoint.py", line 354, in _needs_retry
    responses = self._event_emitter.emit(
  File "/prod_venv/lib/python3.9/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/prod_venv/lib/python3.9/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/prod_venv/lib/python3.9/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 207, in __call__
    if self._checker(**checker_kwargs):
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 284, in __call__
    should_retry = self._should_retry(
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 320, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 363, in __call__
    checker_response = checker(
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/prod_venv/lib/python3.9/site-packages/botocore/endpoint.py", line 281, in _do_get_response
    http_response = self._send(request)
  File "/prod_venv/lib/python3.9/site-packages/botocore/endpoint.py", line 377, in _send
    return self.http_session.send(request)
  File "/prod_venv/lib/python3.9/site-packages/botocore/httpsession.py", line 494, in send
    raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://mynamespace.compat.objectstorage.myregion.oraclecloud.com/my-bucket?prefix=path-to-model&encoding-type=url"

Describing the pod shows that credentials have been injected into the pod:

Init Containers:
  storage-initializer:
    Container ID:  cri-o://286c88d0af601158b01a627622c3b30001a21cfbd38d86d2938659c9978029ad
    Image:         kserve/storage-initializer:v0.11.2
    Image ID:      docker.io/kserve/storage-initializer@sha256:6e0730cae62d23dcdee80a01f2b69ffba9ff5acb3ef98129a0833cbab88a756e
    Port:          <none>
    Host Port:     <none>
    Args:
      s3://my-bucket/path-to-model
      /mnt/models
    State:       Running
      Started:   Thu, 10 Oct 2024 20:44:24 +0200
    Last State:  Terminated
      Reason:    Error
      Message:   _needs_retry(
  File "/prod_venv/lib/python3.9/site-packages/botocore/endpoint.py", line 354, in _needs_retry
    responses = self._event_emitter.emit(
  File "/prod_venv/lib/python3.9/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/prod_venv/lib/python3.9/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/prod_venv/lib/python3.9/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 207, in __call__
    if self._checker(**checker_kwargs):
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 284, in __call__
    should_retry = self._should_retry(
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 320, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 363, in __call__
    checker_response = checker(
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
  File "/prod_venv/lib/python3.9/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/prod_venv/lib/python3.9/site-packages/botocore/endpoint.py", line 281, in _do_get_response
    http_response = self._send(request)
  File "/prod_venv/lib/python3.9/site-packages/botocore/endpoint.py", line 377, in _send
    return self.http_session.send(request)
  File "/prod_venv/lib/python3.9/site-packages/botocore/httpsession.py", line 494, in send
    raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://mynamespace.compat.objectstorage.myregion.oraclecloud.com/my-bucket?prefix=path-to-model&encoding-type=url"

      Exit Code:    1
      Started:      Thu, 10 Oct 2024 20:44:03 +0200
      Finished:     Thu, 10 Oct 2024 20:44:23 +0200
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     100m
      memory:  100Mi
    Environment:
      AWS_ACCESS_KEY_ID:       <set to the key 'AWS_ACCESS_KEY_ID' in secret 's3-secret'>      Optional: false
      AWS_SECRET_ACCESS_KEY:   <set to the key 'AWS_SECRET_ACCESS_KEY' in secret 's3-secret'>  Optional: false
      S3_USE_HTTPS:            1
      S3_ENDPOINT:             mynamespace.compat.objectstorage.myregion.oraclecloud.com
      AWS_ENDPOINT_URL:        https://mynamespace.compat.objectstorage.myregion.oraclecloud.com
      awsAnonymousCredential:  false
      AWS_DEFAULT_REGION:      myregion
    Mounts:
      /mnt/models from kserve-provision-location (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xxxx (ro)

Following the documentation, I also tried setting proxy.holdApplicationUntilProxyStarts: true in istio-sidecar-injector but this didn't help.

What did you expect to happen: The storageinitializer would download the models and inferenceservice would start.

Anything else you would like to add: If I'm running an s3 client (using boto3) in a simple Python pod, I'm able to connect and download objects from the S3 bucket.

Environment:

spolti commented 2 days ago

Is the url https://mynamespace.compat.objectstorage.myregion.oraclecloud.com/ accessible from your environment?

Lejboelle commented 2 days ago

Yes, I can spin up a pod using a python image in the same k8s namespace (obs: not to be confused with the namespace for the bucket), set up an s3 client using boto and list (and download) my objects from there. Just doesn't connect in the inferenceservice.

spolti commented 1 day ago

OKay, how did you connect with our python image? Did you use the exactly same URL and port?