Unable to run InferenceService on a local cluster

yurkoff-mv commented 4 months ago

Validation Checklist

[X] Is this a Kubeflow issue?
[X] Are you posting in the right repository ?
[X] Did you follow the installation guide https://github.com/kubeflow/manifests?tab=readme-ov-file ?
[X] Is the issue report properly structured and detailed with version numbers?
[ ] Is this for Kubeflow development ?
[ ] Would you like to work on this issue?
[ ] Join our slack channel using wg-manifests.

Version

1.8

Describe your issue

I have a local cluster without internet access. Manifests version 1.8 is deployed on it. I deployed this version using images imported as tar files. I also imported the image for InferenceService as a tar file. However, the service does not start. If you run the command microk8s kubectl describe inferenceservices -n kubeflow-namespace llm, you may see the following error message: Revision "llm -predictor-00001" failed with message: Unable to fetch image "yurkoff/torchserve-kfs:0.9.0-gpu": failed to resolve image to digest: Get "https://index.docker.io/v2 /": read tcp 10.1.22.219:48238->54.198.86.24:443: read: connection reset by peer. Moreover, the image is present in microk8s ctr... microk8s ctr images list | grep yurkoff docker.io/yurkoff/torchserve-kfs:0.9.0-gpu application/vnd.docker.distribution.manifest.v2+json sha256:1b771d7c0c2d26f78e892997cb00e6051c77cf3654827c4715aa5a502267ee76 5.7 GiB linux/amd64 io.cri-containerd.image=managed

My yaml-file for InferenceSevice (Please note that I specifically set imagePullPolicy: "Never"):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: "llm"
  namespace: "kubeflow-namespace"
spec:
  predictor:
    pytorch:
      protocolVersion: v1
      runtimeVersion: "0.9.0-gpu"
      image: "yurkoff/torchserve-kfs:0.9.0-gpu"
      imagePullPolicy: "Never"
      storageUri: pvc://torchserve-claim/llm
      resources:
        requests:
          cpu: "2"
          memory: 16Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "4"
          memory: 30Gi
          nvidia.com/gpu: "1"
    minReplicas: 1
    maxReplicas: 1
    timeout: 180

Steps to reproduce the issue

In other machine with internet:

microk8s ctr images pull docker.io/yurkoff/torchserve-kfs:0.9.0-gpu
microk8s ctr images export yurkoff_torchserve-kfs_0.9.0-gpu.tar docker.io/yurkoff/torchserve-kfs:0.9.0-gpu

In local machine without internet:

microk8s ctr images import yurkoff_torchserve-kfs_0.9.0-gpu.tar
microk8s kubectl apply -f llm_isvc.yaml

Put here any screenshots or videos (optional)

No response

juliusvonkohout commented 4 months ago

Hello, l do not see how that is Kubeflow related, i only see microk8s issues.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliusvonkohout commented 2 months ago

Probably a duplicate of https://github.com/kubeflow/manifests/issues/2575

juliusvonkohout commented 2 months ago

Lets continue with secure stuff in https://github.com/kubeflow/manifests/issues/2811

kubeflow / manifests