kubeflow / manifests

A repository for Kustomize manifests
Apache License 2.0
807 stars 869 forks source link

Unable to run InferenceService on a local cluster #2715

Closed yurkoff-mv closed 2 months ago

yurkoff-mv commented 4 months ago

Validation Checklist

Version

1.8

Describe your issue

I have a local cluster without internet access. Manifests version 1.8 is deployed on it. I deployed this version using images imported as tar files. I also imported the image for InferenceService as a tar file. However, the service does not start. If you run the command microk8s kubectl describe inferenceservices -n kubeflow-namespace llm, you may see the following error message: Revision "llm -predictor-00001" failed with message: Unable to fetch image "yurkoff/torchserve-kfs:0.9.0-gpu": failed to resolve image to digest: Get "https://index.docker.io/v2 /": read tcp 10.1.22.219:48238->54.198.86.24:443: read: connection reset by peer. Moreover, the image is present in microk8s ctr... microk8s ctr images list | grep yurkoff docker.io/yurkoff/torchserve-kfs:0.9.0-gpu application/vnd.docker.distribution.manifest.v2+json sha256:1b771d7c0c2d26f78e892997cb00e6051c77cf3654827c4715aa5a502267ee76 5.7 GiB linux/amd64 io.cri-containerd.image=managed

My yaml-file for InferenceSevice (Please note that I specifically set imagePullPolicy: "Never"):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: "llm"
  namespace: "kubeflow-namespace"
spec:
  predictor:
    pytorch:
      protocolVersion: v1
      runtimeVersion: "0.9.0-gpu"
      image: "yurkoff/torchserve-kfs:0.9.0-gpu"
      imagePullPolicy: "Never"
      storageUri: pvc://torchserve-claim/llm
      resources:
        requests:
          cpu: "2"
          memory: 16Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "4"
          memory: 30Gi
          nvidia.com/gpu: "1"
    minReplicas: 1
    maxReplicas: 1
    timeout: 180

Steps to reproduce the issue

In other machine with internet:

  1. microk8s ctr images pull docker.io/yurkoff/torchserve-kfs:0.9.0-gpu
  2. microk8s ctr images export yurkoff_torchserve-kfs_0.9.0-gpu.tar docker.io/yurkoff/torchserve-kfs:0.9.0-gpu

In local machine without internet:

  1. microk8s ctr images import yurkoff_torchserve-kfs_0.9.0-gpu.tar
  2. microk8s kubectl apply -f llm_isvc.yaml

Put here any screenshots or videos (optional)

No response

juliusvonkohout commented 4 months ago

Hello, l do not see how that is Kubeflow related, i only see microk8s issues.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliusvonkohout commented 2 months ago

Probably a duplicate of https://github.com/kubeflow/manifests/issues/2575

juliusvonkohout commented 2 months ago

Lets continue with secure stuff in https://github.com/kubeflow/manifests/issues/2811