Closed ThomasBlock closed 1 month ago
@sonic-chain please track this issue
Solution:
ubi task not working:
resource-exporter
,This component gets the CPU model of the machine (amd or intel) and pulls different ubi-task images.
kubectl delete ds -n kube-system resource-exporter-ds
docker rmi -f filswan/resource-exporter:v11.2.5
resource-exporter
. Refer to: [install-the-hardware-resource-exporter]error 400 and Error building Docker image:
init still borken:
Solution:
ubi task not working:
update
resource-exporter
,This component gets the CPU model of the machine (amd or intel) and pulls different ubi-task images.
kubectl delete ds -n kube-system resource-exporter-ds
docker rmi -f filswan/resource-exporter:v11.2.5
Reinstall
resource-exporter
. Refer to: [install-the-hardware-resource-exporter]error 400 and Error building Docker image:
- Check whether the space has a dockerfile file, or check whether the dockerfile format is correct
- Check whether the user running cp can run docker commands normally
init still borken:
- There is a delay after the contract is deployed and synchronized to the chain. The server has a retry mechanism to obtain the cp information in the contract. If you can receive the ubi-task task, you can ignore this error. @ThomasBlock
thank you for the reply. i dont't want to be rude. but it always looks like you reply to me with "standard phrases" and dont really look into the problem. its also very slow. it looks like this whole team is really uninterested, or overworked!
i am running this since several weeks. of course i know that docker runs.
i updated resource-exporter on all nodes. ubi task is still broken
time="2024-03-05 09:55:31.728" level=info msg="receive ubi task received: {ID:29710 Name:1000-0-7-19330 Type:0 ZkType:fil-c2-512M InputParam:https://download.fogmetalabs.com/ipfs/QmT7Q8AWBziSbWY8XMZt964tqP48QZncMEvADUhkXRbXPP Signature:0x130502f6fb2adf3d88ab8c3751274b44388b6b959085e86608c4594ff50abcfd13d0b52ccd0f6f212a208f095bd148960399673f34435137b9851b782cc96dd701 Resource:0xc000b078c0}" func=DoUbiTask file="cp_service.go:563"
time="2024-03-05 09:55:31.729" level=info msg="ubi task sign verifing, task_id: 29710, type: fil-c2-512M, verify: true" func=DoUbiTask file="cp_service.go:603"
time="2024-03-05 09:55:31.762" level=info msg="checkResourceAvailableForUbi: needCpu: 1, needMemory: 5.00, needStorage: 1.00" func=checkResourceAvailableForUbi file="cp_service.go:1302"
time="2024-03-05 09:55:31.762" level=info msg="checkResourceAvailableForUbi: remainingCpu: 4, remainingMemory: 13.00, remainingStorage: 353.00" func=checkResourceAvailableForUbi file="cp_service.go:1303"
[GIN] 2024/03/05 - 09:55:31 | 200 | 33.692995ms | 38.104.153.43 | POST "/api/v1/computing/cp/ubi"
time="2024-03-05 09:55:31.770" level=error msg="Failed creating ubi task job: Job.batch \"fil-c2-512m-29710\" is invalid: spec.template.spec.containers[0].image: Required value" func=func1 file="cp_service.go:845"
invalid reference format - i posted you all the logs - please tell me if the space itself is broken. i think its okay as i ran it in the past. can you resommend a "good" space so that we can really test gpu?
@sonic-chain @Normalnoise ?
time="2024-03-05 09:55:31.770" level=error msg="Failed creating ubi task job: Job.batch \"fil-c2-512m-29710\" is invalid: spec.template.spec.containers[0].image: Required value" func=func1 file="cp_service.go:845"
@ThomasBlock this issue should be your CP is still using the old version's resource-exporter
, so you need re-install it following the steps:
kubectl delete ds -n kube-system resource-exporter-ds
docker rmi -f filswan/resource-exporter:v11.2.5
i updated resource-exporter on all nodes. ubi task is still broken
please give me your K8s cluster information, including how many nodes in your K8s cluster, master and every node information(like GPU type, cpu type)
Please execute the following command in your k8s cluster and post the log:
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-xxx
i updated resource-exporter on all nodes. ubi task is still broken
please give me your K8s cluster information, including how many nodes in your K8s cluster, master and every node information(like GPU type, cpu type)
kubectl get nodes
NAME STATUS ROLES AGE VERSION
swan1 Ready control-plane 47d v1.28.2
swan3 Ready
swan1 = cpu ryzen7, master swan3 = ryzen7 + 4090 swan5,swan6 = cpu Xeon swan7 = ryzen7 + 3090
time="2024-03-05 09:55:31.770" level=error msg="Failed creating ubi task job: Job.batch "fil-c2-512m-29710" is invalid: spec.template.spec.containers[0].image: Required value" func=func1 file="cp_service.go:845"
@ThomasBlock this issue should be your CP is still using the old version's
resource-exporter
, so you need re-install it following the steps:
kubectl delete ds -n kube-system resource-exporter-ds
docker rmi -f filswan/resource-exporter:v11.2.5
- Reinstall resource-exporter. Refer to: install-the-hardware-resource-exporter
its by the way really odd that you dont increment the version number, that is confusing. but will do it again: the docker command is not working for me, is the following also okay?
kubectl delete ds -n kube-system resource-exporter-ds
daemonset.apps "resource-exporter-ds" deleted
docker rmi -f filswan/resource-exporter:v11.2.5
Error response from daemon: No such image: filswan/resource-exporter:v11.2.5
ctr -n k8s.io images remove docker.io/filswan/resource-exporter:v11.2.5
docker.io/filswan/resource-exporter:v11.2.5
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
namespace: kube-system
name: resource-exporter-ds
labels:
app: resource-exporter
spec:
selector:
matchLabels:
app: resource-exporter
template:
metadata:
labels:
app: resource-exporter
spec:
containers:
- name: resource-exporter
image: filswan/resource-exporter:v11.2.5
imagePullPolicy: IfNotPresent
EOF
daemonset.apps/resource-exporter-ds created
Please execute the following command in your k8s cluster and post the log:
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-xxx
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-6lssn
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 4090","fb_memory_usage":{"total":"24564 MiB","used":"347 MiB","free":"24216 MiB"},"bar1_memory_usage":{"total":"32768 MiB","used":"2 MiB","free":"32765 MiB"}}]},"cpu_name":"AMD"}
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-9lcxf
If the node has a GPU, this error can be ignored. ERROR:: unable to initialize NVML: 12
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-c48zq
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 3090","fb_memory_usage":{"total":"24576 MiB","used":"317 MiB","free":"24258 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"2 MiB","free":"253 MiB"}}]},"cpu_name":"AMD"}
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-k8xfq
If the node has a GPU, this error can be ignored. ERROR:: unable to initialize NVML: 12
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-rnwxz
If the node has a GPU, this error can be ignored. ERROR:: unable to initialize NVML: 12
The resource-exporter pod of a node on your cluster cannot obtain resource information and reports an error. You can set the RUST_GPU_TOOLS_CUSTOM_GPU
content to 3090 or 4090 in the fil-c2.env
file to let k8s schedule it to these normal nodes.
I fixed the situation where resource-exporter
could not be retrieved without nvidia-plugin installed. I would like to trouble you to upgrade the version to filswan/resource-exporter:v11.2.6
again.
I fixed the situation where
resource-exporter
could not be retrieved without nvidia-plugin installed. I would like to trouble you to upgrade the version tofilswan/resource-exporter:v11.2.6
again.
kubectl delete ds -n kube-system resource-exporter-ds
daemonset.apps "resource-exporter-ds" deleted
5x
ctr -n k8s.io images remove docker.io/filswan/resource-exporter:v11.2.5
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
namespace: kube-system
name: resource-exporter-ds
labels:
app: resource-exporter
spec:
selector:
matchLabels:
app: resource-exporter
template:
metadata:
labels:
app: resource-exporter
spec:
containers:
- name: resource-exporter
image: filswan/resource-exporter:v11.2.6
imagePullPolicy: IfNotPresent
EOF
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-5r2rv
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 3090","fb_memory_usage":{"total":"24576 MiB","used":"320 MiB","free":"24255 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"3 MiB","free":"252 MiB"}}]},"cpu_name":"AMD"}
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-jdpn4
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 4090","fb_memory_usage":{"total":"24564 MiB","used":"347 MiB","free":"24216 MiB"},"bar1_memory_usage":{"total":"32768 MiB","used":"2 MiB","free":"32765 MiB"}}]},"cpu_name":"AMD"}
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-9dg2z
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-ljzbs
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"}
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-x2jqq
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}
some dont report a cpu. but this might make sense: i checked, some of them are virtual amchines with "generic" proxmox type. Some are type "host", here we can see "cpu_name":"INTEL" and "cpu_name":"AMD"
The resource-exporter pod of a node on your cluster cannot obtain resource information and reports an error. You can set the
RUST_GPU_TOOLS_CUSTOM_GPU
content to 3090 or 4090 in thefil-c2.env
file to let k8s schedule it to these normal nodes.
ok thank you i did this. will this help me to get 32 G ubi tasks? until now i only receive 512M tasks..
RUST_GPU_TOOLS_CUSTOM_GPU="NVIDIA GeForce RTX 3090:10496,NVIDIA GeForce RTX 4090:16384"
There are only 512M tasks now, and the 32G tasks are gone now, but there will be more in the future.
I fixed the situation where
resource-exporter
could not be retrieved without nvidia-plugin installed. I would like to trouble you to upgrade the version tofilswan/resource-exporter:v11.2.6
again.kubectl delete ds -n kube-system resource-exporter-ds daemonset.apps "resource-exporter-ds" deleted 5x ctr -n k8s.io images remove docker.io/filswan/resource-exporter:v11.2.5 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: namespace: kube-system name: resource-exporter-ds labels: app: resource-exporter spec: selector: matchLabels: app: resource-exporter template: metadata: labels: app: resource-exporter spec: containers: - name: resource-exporter image: filswan/resource-exporter:v11.2.6 imagePullPolicy: IfNotPresent EOF
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-5r2rv {"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 3090","fb_memory_usage":{"total":"24576 MiB","used":"320 MiB","free":"24255 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"3 MiB","free":"252 MiB"}}]},"cpu_name":"AMD"} kubectl logs --tail=1 -fn kube-system resource-exporter-ds-jdpn4 {"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 4090","fb_memory_usage":{"total":"24564 MiB","used":"347 MiB","free":"24216 MiB"},"bar1_memory_usage":{"total":"32768 MiB","used":"2 MiB","free":"32765 MiB"}}]},"cpu_name":"AMD"} kubectl logs --tail=1 -fn kube-system resource-exporter-ds-9dg2z {"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""} The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored. kubectl logs --tail=1 -fn kube-system resource-exporter-ds-ljzbs {"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"} The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored. kubectl logs --tail=1 -fn kube-system resource-exporter-ds-x2jqq {"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}
some dont report a cpu. but this might make sense: i checked, some of them are virtual amchines with "generic" proxmox type. Some are type "host", here we can see "cpu_name":"INTEL" and "cpu_name":"AMD"
ubi-task needs to pull different docker images according to different CPU architectures. Currently, it only supports intel and amd. Other architectures are not supported for the time being.
I fixed the situation where
resource-exporter
could not be retrieved without nvidia-plugin installed. I would like to trouble you to upgrade the version tofilswan/resource-exporter:v11.2.6
again.kubectl delete ds -n kube-system resource-exporter-ds daemonset.apps "resource-exporter-ds" deleted 5x ctr -n k8s.io images remove docker.io/filswan/resource-exporter:v11.2.5 cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: DaemonSet metadata: namespace: kube-system name: resource-exporter-ds labels: app: resource-exporter spec: selector: matchLabels: app: resource-exporter template: metadata: labels: app: resource-exporter spec: containers: - name: resource-exporter image: filswan/resource-exporter:v11.2.6 imagePullPolicy: IfNotPresent EOF
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-5r2rv {"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 3090","fb_memory_usage":{"total":"24576 MiB","used":"320 MiB","free":"24255 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"3 MiB","free":"252 MiB"}}]},"cpu_name":"AMD"} kubectl logs --tail=1 -fn kube-system resource-exporter-ds-jdpn4 {"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 4090","fb_memory_usage":{"total":"24564 MiB","used":"347 MiB","free":"24216 MiB"},"bar1_memory_usage":{"total":"32768 MiB","used":"2 MiB","free":"32765 MiB"}}]},"cpu_name":"AMD"} kubectl logs --tail=1 -fn kube-system resource-exporter-ds-9dg2z {"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""} The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored. kubectl logs --tail=1 -fn kube-system resource-exporter-ds-ljzbs {"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"} The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored. kubectl logs --tail=1 -fn kube-system resource-exporter-ds-x2jqq {"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}
some dont report a cpu. but this might make sense: i checked, some of them are virtual amchines with "generic" proxmox type. Some are type "host", here we can see "cpu_name":"INTEL" and "cpu_name":"AMD"
ubi-task needs to pull different docker images according to different CPU architectures. Currently, it only supports intel and amd. Other architectures are not supported for the time being.
okay fine. we now clarified that VMs should use a "host" processor type. any further comments on my setup?
my ubi tasks are working now. proofs are submitted onchain. But in the last week there have been no payouts. for me and other users.
if you have not received the rewards, maybe there are some reasons:
Thank you for the update v0.4.4 @Normalnoise . you did not really mention if Multichain.storage is now still down, but here is my feedback for the actual state
ubi task not working:
lagrange GPU renting ok ( "pacman" ) although with 400 error
lagrange GPU renting with stablediffusion not possible: error 400 and Error building Docker image: Error response from daemon: invalid reference format
init still borken
collateral still inaccurate
HUB info for beneficiary: 8.65 sETH HUB info for owner: 1.49 sETH computing-provider info: Collateral(SWAN-ETH): 0.95000