Closed ThomasBlock closed 3 months ago
by the way: its strange that some models cannot be rented all. for example i have 4060, 4080 and A5000 . so if you do not create new categories it would be great to include them in the smaller / older categories: include 4080 in 3080 and A5000 in A4000 for example..
@ThomasBlock UBI task will use all kinds of GPU, if you have a gpu type that is not in the hardware list, please publish a new issue, we are glad to add it to the list.
@Normalnoise so my problem still exists: my provider is on lagrangedao.org
only in CPU mode and not in GPU mode, although i have enough free resources as can be seen on orchestrator.swanchain.io
.
as a consequence, my provider is full pf CPU tasks. which is nice but a little bit useless for a GPU Testnet;
Sorry @ThomasBlock the team is in the holiday, so your gpu type will be added to the list after about 10days
If your A4000 can not be used, you can try to manually assign a gpu task to your own CP to see what happened when you get a gpu task. It can be completed by select the region of your CP
If your A4000 can not be used, you can try to manually assign a gpu task to your own CP to see what happened when you get a gpu task. It can be completed by select the region of your CP
thank you for the feedback. but that what you describe does not work. i can sissgn cpu tasks to my region.
but i cannot assign gpu task to my gegion. it is greyed out as seen here:
i can manually assign GPU tasks to myself and they run then. but this way i dont collect SWAN rewards right?
curl --location --request POST 'https://domain.com:8085/api/v1/computing/lagrange/jobs' \
--header 'Content-Type: application/json' \
--data-raw '{
"uuid": "347ef245-74aa-4191-9b85-671a1016f3d4",
"name": "Job-347ef245-74aa-4191-9b85-671a1016f3d4",
"status": "Submitted",
"duration": 900,
"job_source_uri": "https://api.lagrangedao.org/spaces/671cd2fb-4e80-4107-b089-df49358c96ee",
"storage_source": "lagrange",
"task_uuid": "4146a29c-5387-4324-b896-d9939ebd1728"
}'
Last time this worked but now there are also errors, if that helps you: failed, and type is still CPU:
Priority: 0
Service Account: default
Node: swan2/192.168.128.72
Start Time: Sun, 11 Feb 2024 09:49:34 +0100
Labels: lad_app=671cd2fb-4e80-4107-b089-df49358c96ee
pod-template-hash=5bff97bfc6
Annotations: cni.projectcalico.org/containerID: 955b21812bee66c0e6ad6ff0ac1f67ab493a21f62763913dcceeb41a207a1c7f
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
Status: Failed
Reason: Evicted
Message: Pod ephemeral local storage usage exceeds the total limit of containers 5Gi.
IP: 172.16.177.127
IPs:
IP: 172.16.177.127
Controlled By: ReplicaSet/deploy-671cd2fb-4e80-4107-b089-df49358c96ee-5bff97bfc6
Containers:
pod-671cd2fb-4e80-4107-b089-df49358c96ee:
Container ID: containerd://95157ee932380a7af00813b0f4c15fc26191df6b4c2dda7877376c9d5c2f2355
Image: 192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1707641059
Image ID: 192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee@sha256:85c52687166e2ba2b4b4e5aa4ca59c5157f75f10aad06fb008320bef8566ae0b
Port: 9999/TCP
Host Port: 0/TCP
State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Sun, 11 Feb 2024 09:50:31 +0100
Finished: Sun, 11 Feb 2024 09:51:53 +0100
Ready: False
Restart Count: 0
Limits:
cpu: 4
ephemeral-storage: 5Gi
memory: 4Gi
nvidia.com/gpu: 0
Requests:
cpu: 4
ephemeral-storage: 5Gi
memory: 4Gi
nvidia.com/gpu: 0
Environment:
space_uuid: 671cd2fb-4e80-4107-b089-df49358c96ee
space_name: Stable-Diffusion-Bse-LoRA
result_url: 5jchih183r.domain.com
job_uuid: 347ef245-74aa-4191-9b85-671a1016f3d4
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m6j5f (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-m6j5f:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m38s default-scheduler Successfully assigned ns-0x.../deploy-671cd2fb-4e80-4107-b089-df49358c96ee-5bff97bfc6-6mxww to swan2
Normal Pulling 4m37s kubelet Pulling image "192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1707641059"
Normal Pulled 3m41s kubelet Successfully pulled image "192.168.128.71:5000/stable-diffusion-bse-lora-df49358c96ee:1707641059" in 56.412s (56.412s including waiting)
Normal Created 3m41s kubelet Created container pod-671cd2fb-4e80-4107-b089-df49358c96ee
Normal Started 3m41s kubelet Started container pod-671cd2fb-4e80-4107-b089-df49358c96ee
Warning Evicted 2m35s kubelet Pod ephemeral local storage usage exceeds the total limit of containers 5Gi.
Normal Killing 2m35s kubelet Stopping container pod-671cd2fb-4e80-4107-b089-df49358c96ee
Warning ExceededGracePeriod 2m25s kubelet Container runtime did not kill the pod within specified grace period.
So it's all good that we see a lot of cpu tasks on the network.. but it would also be nice to utilize the GPUs. last week i could lease my system. But this week it can no longer be found in the stats. what could be the problem here?
I have enough capacities and they are published to the hub
but i can not deploy the category on lagrangedao.org/space