swanchain / go-computing-provider

A golang implementation of computing provider
MIT License
11 stars 15 forks source link

ErrImagePull - stable-diffusion-bse-lora not deploying: HTTP response to HTTPS client #26

Closed ThomasBlock closed 1 month ago

ThomasBlock commented 4 months ago

Thank you for the ongoing updates. My tasks and ubi tasks work fine. Also now gpu types are active. Now i finally want to deploy GPU spaces via lagrange.

This specific one does no longer work. do you have an idea why? seems like https is expected altough all my configs tell to use http

At first i see 5 Minutes different logs regarding progress, resulting in this on the compute-provider logs:

{"status":"Pushing","progressDetail":{"current":2627464641,"total":2628912606},"progress":"[=================================================\u003e ]  2.627GB/2.629GB","id":"837c9a2fc6cf"}
{"status":"Pushing","progressDetail":{"current":2629180416,"total":2628912606},"progress":"[==================================================\u003e]  2.629GB","id":"837c9a2fc6cf"}
{"status":"Pushed","progressDetail":{},"id":"837c9a2fc6cf"}
{"status":"1708720079: digest: sha256:4c3f87cd45a0a99c58d2e91ed24647f6248caada44b31daaf1108b4184f5373e size: 3493"}
{"progressDetail":{},"aux":{"Tag":"1708720079","Digest":"sha256:4c3f87cd45a0a99c58d2e91ed24647f6248caada44b31daaf1108b4184f5373e","Size":3493}}
time="2024-02-23 21:33:10.542" level=info msg="Deleted ingress ing-671cd2fb-4e80-4107-b089-df49358c96ee finished" func=deleteJob file="cp_service.go:1067"
time="2024-02-23 21:33:10.543" level=info msg="Deleted service svc-671cd2fb-4e80-4107-b089-df49358c96ee finished" func=deleteJob file="cp_service.go:1073"
time="2024-02-23 21:33:16.545" level=info msg="Deleted deployment deploy-671cd2fb-4e80-4107-b089-df49358c96ee finished" func=deleteJob file="cp_service.go:1090"
time="2024-02-23 21:33:19.552" level=info msg="Deleted all resource finised. spaceUuid: 671cd2fb-4e80-4107-b089-df49358c96ee" func=deleteJob file="cp_service.go:1117"
time="2024-02-23 21:33:19.601" level=info msg="Created deployment: deploy-671cd2fb-4e80-4107-b089-df49358c96ee" func=DockerfileToK8s file="deploy.go:150"
time="2024-02-23 21:33:19.684" level=info msg="Created service successfully: svc-671cd2fb-4e80-4107-b089-df49358c96ee" func=deployK8sResource file="deploy.go:532"
time="2024-02-23 21:33:20.348" level=info msg="Created Ingress successfully: ing-671cd2fb-4e80-4107-b089-df49358c96ee" func=deployK8sResource file="deploy.go:540"

For other pods this is fine now. but here we stay in the error state ErrImagePull

kubectl describe pod -n ns-XXX   deploy-671cd2fb-4e80-4107-b089-df49358c96ee-7cb78c9975-w2hb5 
Name:             deploy-671cd2fb-4e80-4107-b089-df49358c96ee-7cb78c9975-w2hb5
Namespace:        ns-XXX
Priority:         0
Service Account:  default
Node:             swan3/192.168.128.73
Start Time:       Fri, 23 Feb 2024 21:33:20 +0100
Labels:           lad_app=671cd2fb-4e80-4107-b089-df49358c96ee
                  pod-template-hash=7cb78c9975
Annotations:      cni.projectcalico.org/containerID: 2044a4080bb120c25bdd3b5432ea16d61a2df8161e590aba55fe44aca5378deb
                  cni.projectcalico.org/podIP: 172.16.59.105/32
                  cni.projectcalico.org/podIPs: 172.16.59.105/32
Status:           Pending
IP:               172.16.59.105
IPs:
  IP:           172.16.59.105
Controlled By:  ReplicaSet/deploy-671cd2fb-4e80-4107-b089-df49358c96ee-7cb78c9975
Containers:
  pod-671cd2fb-4e80-4107-b089-df49358c96ee:
    Container ID:   
    Image:          192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1708720079
    Image ID:       
    Port:           9999/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                8
      ephemeral-storage:  20Gi
      memory:             64Gi
      nvidia.com/gpu:     1
    Requests:
      cpu:                8
      ephemeral-storage:  20Gi
      memory:             64Gi
      nvidia.com/gpu:     1
    Environment:
      space_uuid:  671cd2fb-4e80-4107-b089-df49358c96ee
      space_name:  Stable-Diffusion-Bse-LoRA
      result_url:  g6kj77kqio.bitstakehaven.com
      job_uuid:    a1a14259-477c-4426-8981-7ce7c224b5db
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-6p2lb (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-6p2lb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              NVIDIA-4090=true
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m40s                  default-scheduler  Successfully assigned ns-XX/deploy-671cd2fb-4e80-4107-b089-df49358c96ee-7cb78c9975-w2hb5 to swan3
  Warning  Failed     3m18s (x6 over 4m39s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    3m3s (x4 over 4m40s)   kubelet            Pulling image "192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1708720079"
  Warning  Failed     3m3s (x4 over 4m40s)   kubelet            Failed to pull image "192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1708720079": failed to pull and unpack image "192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1708720079": failed to resolve reference "192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1708720079": failed to do request: Head "https://192.168.128.71:5000/v2/stable-diffusion-bse-lora-df49358c96ee/manifests/1708720079": http: server gave HTTP response to HTTPS client
  Warning  Failed     3m3s (x4 over 4m40s)   kubelet            Error: ErrImagePull
  Normal   BackOff    2m52s (x7 over 4m39s)  kubelet            Back-off pulling image "192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1708720079"

The IP is reachable. i can do curl 192.168.128.71:5000 withour error

sudo nano /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".registry]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."192.168.128.71:5000"]
      endpoint = ["http://192.168.128.71:5000"]

[plugins."io.containerd.grpc.v1.cri".registry.configs]
  [plugins."io.containerd.grpc.v1.cri".registry.configs."192.168.128.71:5000".tls]
    insecure_skip_verify = true
nano /etc/docker/daemon.json
{
"insecure-registries": ["192.168.128.71:5000"],
      "exec-opts": ["native.cgroupdriver=systemd"],
      "log-driver": "json-file",
      "log-opts": {
      "max-size": "100m"
   },

       "storage-driver": "overlay2"
       }
ThomasBlock commented 4 months ago

Other tasks work.. so i am playing pacman on a 4090..

  Normal   Pulled            21s    kubelet            Container image "sxk1633/game-pacman:latest" already present on machine
  Normal   Created           21s    kubelet            Created container 41d84b94-3063-46a9-b96c-d0241ae62b22-super
  Normal   Started           21s    kubelet            Started container 41d84b94-3063-46a9-b96c-d0241ae62b22-super