swanchain / go-computing-provider

A golang implementation of computing provider
MIT License
11 stars 15 forks source link

gpu tasks #18

Closed ThomasBlock closed 3 months ago

ThomasBlock commented 5 months ago

So it's all good that we see a lot of cpu tasks on the network.. but it would also be nice to utilize the GPUs. last week i could lease my system. But this week it can no longer be found in the stats. what could be the problem here?

I have enough capacities and they are published to the hub image

but i can not deploy the category on lagrangedao.org/space

Bildschirmfoto vom 2024-02-01 20-13-10

ThomasBlock commented 5 months ago

by the way: its strange that some models cannot be rented all. for example i have 4060, 4080 and A5000 . so if you do not create new categories it would be great to include them in the smaller / older categories: include 4080 in 3080 and A5000 in A4000 for example..

Normalnoise commented 5 months ago

@ThomasBlock UBI task will use all kinds of GPU, if you have a gpu type that is not in the hardware list, please publish a new issue, we are glad to add it to the list.

ThomasBlock commented 4 months ago

@Normalnoise so my problem still exists: my provider is on lagrangedao.org only in CPU mode and not in GPU mode, although i have enough free resources as can be seen on orchestrator.swanchain.io.

as a consequence, my provider is full pf CPU tasks. which is nice but a little bit useless for a GPU Testnet;

Bildschirmfoto vom 2024-02-11 09-16-07

Normalnoise commented 4 months ago

Sorry @ThomasBlock the team is in the holiday, so your gpu type will be added to the list after about 10days

Normalnoise commented 4 months ago

If your A4000 can not be used, you can try to manually assign a gpu task to your own CP to see what happened when you get a gpu task. It can be completed by select the region of your CP

ThomasBlock commented 4 months ago

If your A4000 can not be used, you can try to manually assign a gpu task to your own CP to see what happened when you get a gpu task. It can be completed by select the region of your CP

thank you for the feedback. but that what you describe does not work. i can sissgn cpu tasks to my region.

Bildschirmfoto vom 2024-02-11 09-49-15

but i cannot assign gpu task to my gegion. it is greyed out as seen here: image

i can manually assign GPU tasks to myself and they run then. but this way i dont collect SWAN rewards right?

curl --location --request POST 'https://domain.com:8085/api/v1/computing/lagrange/jobs' \
--header 'Content-Type: application/json' \
--data-raw '{
"uuid": "347ef245-74aa-4191-9b85-671a1016f3d4",
"name": "Job-347ef245-74aa-4191-9b85-671a1016f3d4",
"status": "Submitted",
"duration": 900,
"job_source_uri": "https://api.lagrangedao.org/spaces/671cd2fb-4e80-4107-b089-df49358c96ee",
"storage_source": "lagrange",
"task_uuid": "4146a29c-5387-4324-b896-d9939ebd1728"
}'

Last time this worked but now there are also errors, if that helps you: failed, and type is still CPU: Bildschirmfoto vom 2024-02-11 09-53-29

Priority:         0
Service Account:  default
Node:             swan2/192.168.128.72
Start Time:       Sun, 11 Feb 2024 09:49:34 +0100
Labels:           lad_app=671cd2fb-4e80-4107-b089-df49358c96ee
                  pod-template-hash=5bff97bfc6
Annotations:      cni.projectcalico.org/containerID: 955b21812bee66c0e6ad6ff0ac1f67ab493a21f62763913dcceeb41a207a1c7f
                  cni.projectcalico.org/podIP: 
                  cni.projectcalico.org/podIPs: 
Status:           Failed
Reason:           Evicted
Message:          Pod ephemeral local storage usage exceeds the total limit of containers 5Gi. 
IP:               172.16.177.127
IPs:
  IP:           172.16.177.127
Controlled By:  ReplicaSet/deploy-671cd2fb-4e80-4107-b089-df49358c96ee-5bff97bfc6
Containers:
  pod-671cd2fb-4e80-4107-b089-df49358c96ee:
    Container ID:   containerd://95157ee932380a7af00813b0f4c15fc26191df6b4c2dda7877376c9d5c2f2355
    Image:          192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1707641059
    Image ID:       192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee@sha256:85c52687166e2ba2b4b4e5aa4ca59c5157f75f10aad06fb008320bef8566ae0b
    Port:           9999/TCP
    Host Port:      0/TCP
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Sun, 11 Feb 2024 09:50:31 +0100
      Finished:     Sun, 11 Feb 2024 09:51:53 +0100
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                4
      ephemeral-storage:  5Gi
      memory:             4Gi
      nvidia.com/gpu:     0
    Requests:
      cpu:                4
      ephemeral-storage:  5Gi
      memory:             4Gi
      nvidia.com/gpu:     0
    Environment:
      space_uuid:  671cd2fb-4e80-4107-b089-df49358c96ee
      space_name:  Stable-Diffusion-Bse-LoRA
      result_url:  5jchih183r.domain.com
      job_uuid:    347ef245-74aa-4191-9b85-671a1016f3d4
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m6j5f (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-m6j5f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason               Age    From               Message
  ----     ------               ----   ----               -------
  Normal   Scheduled            4m38s  default-scheduler  Successfully assigned ns-0x.../deploy-671cd2fb-4e80-4107-b089-df49358c96ee-5bff97bfc6-6mxww to swan2
  Normal   Pulling              4m37s  kubelet            Pulling image "192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1707641059"
  Normal   Pulled               3m41s  kubelet            Successfully pulled image "192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1707641059" in 56.412s (56.412s including waiting)
  Normal   Created              3m41s  kubelet            Created container pod-671cd2fb-4e80-4107-b089-df49358c96ee
  Normal   Started              3m41s  kubelet            Started container pod-671cd2fb-4e80-4107-b089-df49358c96ee
  Warning  Evicted              2m35s  kubelet            Pod ephemeral local storage usage exceeds the total limit of containers 5Gi.
  Normal   Killing              2m35s  kubelet            Stopping container pod-671cd2fb-4e80-4107-b089-df49358c96ee
  Warning  ExceededGracePeriod  2m25s  kubelet            Container runtime did not kill the pod within specified grace period.