swanchain / go-computing-provider

A golang implementation of computing provider
MIT License
11 stars 15 forks source link

v0.4.4. feedback: Error building Docker image #35

Closed ThomasBlock closed 1 month ago

ThomasBlock commented 4 months ago

Thank you for the update v0.4.4 @Normalnoise . you did not really mention if Multichain.storage is now still down, but here is my feedback for the actual state

ubi task not working:

[GIN] 2024/03/04 - 14:55:31 | 200 |  385.587333ms |   38.104.153.43 | GET      "/api/v1/computing/cp"
time="2024-03-04 14:55:31.604" level=info msg="receive ubi task received: {ID:28241 Name:1000-0-7-17861 Type:0 ZkType:fil-c2-512M InputParam:https://download.fogmetalabs.com/ipfs/QmNr558fVVuw2G8vMZ74fvfR2itzLdH47QYwvULciYXBBH Signature:0x90d15e5ba418c9af929587d6f58e22d493deacf0ba1af27bb9a5e4503e2ffcb942c6af8760391679a6edf47ee3eedbfa2fac341100ce32dc7b843b618adb666501 Resource:0xc000b9ad40}" func=DoUbiTask file="cp_service.go:563"
time="2024-03-04 14:55:31.604" level=info msg="ubi task sign verifing, task_id: 28241, type: fil-c2-512M, verify: true" func=DoUbiTask file="cp_service.go:603"
time="2024-03-04 14:55:31.650" level=info msg="checkResourceAvailableForUbi: needCpu: 1, needMemory: 5.00, needStorage: 1.00" func=checkResourceAvailableForUbi file="cp_service.go:1302"
time="2024-03-04 14:55:31.650" level=info msg="checkResourceAvailableForUbi: remainingCpu: 4, remainingMemory: 13.00, remainingStorage: 353.00" func=checkResourceAvailableForUbi file="cp_service.go:1303"
[GIN] 2024/03/04 - 14:55:31 | 200 |   46.700585ms |   38.104.153.43 | POST     "/api/v1/computing/cp/ubi"
time="2024-03-04 14:55:31.659" level=error msg="Failed creating ubi task job: Job.batch \"fil-c2-512m-28241\" is invalid: spec.template.spec.containers[0].image: Required value" func=func1 file="cp_service.go:845"

lagrange GPU renting ok ( "pacman" ) although with 400 error

time="2024-03-04 14:59:14.893" level=info msg="submitting job..." func=submitJob file="cp_service.go:127"
time="2024-03-04 14:59:15.119" level=info msg="uuid: 41d84b94-3063-46a9-b96c-d0241ae62b22, spaceName: pac-man, hardwareName: Nvidia 3090 · 4 vCPU · 8 GiB" func=DeploySpaceTask file="cp_service.go:1035"
time="2024-03-04 14:59:15.186" level=info msg="uploading file to bucket, objectName: jobs/4f91b4f7-c31e-49a6-921a-125a6a00a56a.json, filePath: /tmp/jobs/4f91b4f7-c31e-49a6-921a-125a6a00a56a.json" func=UploadFileToBucket file="storage_service.go:52"
time="2024-03-04 14:59:15.432" level=info msg="Download 41d84b94-3063-46a9-b96c-d0241ae62b22 successfully." func=BuildSpaceTaskImage file="buildspace.go:33"
time="2024-03-04 14:59:15.437" level=info msg="Deleted ingress ing-41d84b94-3063-46a9-b96c-d0241ae62b22 finished" func=deleteJob file="cp_service.go:1084"
time="2024-03-04 14:59:15.462" level=info msg="Deleted service svc-41d84b94-3063-46a9-b96c-d0241ae62b22 finished" func=deleteJob file="cp_service.go:1090"
time="2024-03-04 14:59:16.185" level=error msg="http status: 400 Bad Request, code:400, url:https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/4f91b4f7-c31e-49a6-921a-125a6a00a56a.json" func=HttpRequest file="restful.go:127"
time="2024-03-04 14:59:16.185" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/4f91b4f7-c31e-49a6-921a-125a6a00a56a.json failed, status:error, message:invalid param value:record not found" func=HttpRequest file="restful.go:154"
time="2024-03-04 14:59:16.185" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/4f91b4f7-c31e-49a6-921a-125a6a00a56a.json failed, status:error, message:invalid param value:record not found" func=HttpGet file="restful.go:64"
[GIN] 2024/03/04 - 14:59:16 | 200 |       21.53µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
time="2024-03-04 14:59:17.173" level=error msg="http status: 400 Bad Request, code:400, url:https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs" func=HttpRequest file="restful.go:127"
time="2024-03-04 14:59:17.173" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs failed, status:error, message:invalid param value:record not found" func=HttpRequest file="restful.go:154"
time="2024-03-04 14:59:17.173" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs failed, status:error, message:invalid param value:record not found" func=HttpGet file="restful.go:64"
time="2024-03-04 14:59:19.419" level=info msg="file name:1_4f91b4f7-c31e-49a6-921a-125a6a00a56a.json, chunk size:712" func=func1 file="file.go:248"
time="2024-03-04 14:59:21.472" level=info msg="Deleted deployment deploy-41d84b94-3063-46a9-b96c-d0241ae62b22 finished" func=deleteJob file="cp_service.go:1107"
time="2024-03-04 14:59:21.606" level=info msg="jobuuid: 2b133c35-5bfa-4a80-980b-3e0c5d4f2478 successfully submitted to IPFS" func=submitJob file="cp_service.go:155"
time="2024-03-04 14:59:21.869" level=info msg="submit job detail: {UUID:2b133c35-5bfa-4a80-980b-3e0c5d4f2478 Name:Job-2b133c35-5bfa-4a80-980b-3e0c5d4f2478 Status:submitted Duration:3600 JobSourceURI:https://api.lagrangedao.org/spaces/41d84b94-3063-46a9-b96c-d0241ae62b22 JobResultURI:https://7d67303d2964.acl.swanipfs.com/ipfs/QmcqDAA2d5QrE5FYPipZVqnUNTf8qVeWFVueeU5ToDeVsQ StorageSource:lagrange TaskUUID:a4fb3004-1dab-460d-8662-f58bc221241e CreatedAt:1709560754 UpdatedAt:1709560754 BuildLog:wss://log.bitstakehaven.com:8085/api/v1/computing/lagrange/spaces/log?space_id=41d84b94-3063-46a9-b96c-d0241ae62b22&type=build ContainerLog:wss://log.bitstakehaven.com:8085/api/v1/computing/lagrange/spaces/log?space_id=41d84b94-3063-46a9-b96c-d0241ae62b22&type=container}" func=ReceiveJob file="cp_service.go:122"
[GIN] 2024/03/04 - 14:59:21 | 200 |  7.500569152s |   38.104.153.43 | POST     "/api/v1/computing/lagrange/jobs"

lagrange GPU renting with stablediffusion not possible: error 400 and Error building Docker image: Error response from daemon: invalid reference format

time="2024-03-04 15:04:42.396" level=info msg="Job received Data: {UUID:88911b11-5f3a-46a8-bb70-5efa55afdbf4 Name:Job-88911b11-5f3a-46a8-bb70-5efa55afdbf4 Status:Submitted Duration:3600 JobSourceURI:https://api.lagrangedao.org/spaces/201d7532-8284-4ebd-b348-ff5ce862beda JobResultURI: StorageSource:lagrange TaskUUID:7db9916e-397f-493d-b20f-0e5688042897 CreatedAt:1709561082 UpdatedAt:1709561082 BuildLog: ContainerLog:}" func=ReceiveJob file="cp_service.go:79"
time="2024-03-04 15:04:42.899" level=info msg="checkResourceAvailableForSpace: needCpu: 8, needMemory: 16.00, needStorage: 20.00" func=checkResourceAvailableForSpace file="cp_service.go:1227"
time="2024-03-04 15:04:42.899" level=info msg="checkResourceAvailableForSpace: remainingCpu: 4, remainingMemory: 13.00, remainingStorage: 353.00" func=checkResourceAvailableForSpace file="cp_service.go:1228"
time="2024-03-04 15:04:42.899" level=info msg="checkResourceAvailableForSpace: needCpu: 8, needMemory: 16.00, needStorage: 20.00" func=checkResourceAvailableForSpace file="cp_service.go:1227"
time="2024-03-04 15:04:42.899" level=info msg="checkResourceAvailableForSpace: remainingCpu: 23, remainingMemory: 65.00, remainingStorage: 1593.00" func=checkResourceAvailableForSpace file="cp_service.go:1228"
time="2024-03-04 15:04:42.899" level=info msg="gpuName: NVIDIA-4090, nodeGpu: map[:0 kubernetes.io/os:0], nodeGpuSummary: map[swan3:map[NVIDIA-4090:1] swan7:map[NVIDIA-3090:1]]" func=checkResourceAvailableForSpace file="cp_service.go:1235"
time="2024-03-04 15:04:42.900" level=info msg="submitting job..." func=submitJob file="cp_service.go:127"
time="2024-03-04 15:04:42.900" level=info msg="uploading file to bucket, objectName: jobs/bfbd05a0-eff5-4275-a0db-7adc9d9bab66.json, filePath: /tmp/jobs/bfbd05a0-eff5-4275-a0db-7adc9d9bab66.json" func=UploadFileToBucket file="storage_service.go:52"
time="2024-03-04 15:04:43.126" level=info msg="uuid: 201d7532-8284-4ebd-b348-ff5ce862beda, spaceName: myDiffusion, hardwareName: Nvidia 4090 · 8 vCPU · 16 GiB" func=DeploySpaceTask file="cp_service.go:1035"
time="2024-03-04 15:04:43.425" level=info msg="Download 201d7532-8284-4ebd-b348-ff5ce862beda successfully." func=BuildSpaceTaskImage file="buildspace.go:33"
time="2024-03-04 15:04:43.540" level=info msg="Download 201d7532-8284-4ebd-b348-ff5ce862beda successfully." func=BuildSpaceTaskImage file="buildspace.go:33"
time="2024-03-04 15:04:43.654" level=info msg="Download 201d7532-8284-4ebd-b348-ff5ce862beda successfully." func=BuildSpaceTaskImage file="buildspace.go:33"
2024/03/04 15:04:43 Image path: build/0x7B0CEe1939a4AdA062EC79f4862a42C1F47B1806/spaces/myDiffusion
time="2024-03-04 15:04:43.655" level=error msg="Error building Docker image: Error response from daemon: invalid reference format" func=BuildImagesByDockerfile file="buildspace.go:80"
time="2024-03-04 15:04:43.655" level=info msg="Failed to extract exposed port: unable to open Dockerfile: open : no such file or directory" func=DockerfileToK8s file="deploy.go:97"
time="2024-03-04 15:04:43.921" level=error msg="http status: 400 Bad Request, code:400, url:https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/bfbd05a0-eff5-4275-a0db-7adc9d9bab66.json" func=HttpRequest file="restful.go:127"
time="2024-03-04 15:04:43.921" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/bfbd05a0-eff5-4275-a0db-7adc9d9bab66.json failed, status:error, message:invalid param value:record not found" func=HttpRequest file="restful.go:154"
time="2024-03-04 15:04:43.921" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/bfbd05a0-eff5-4275-a0db-7adc9d9bab66.json failed, status:error, message:invalid param value:record not found" func=HttpGet file="restful.go:64"
time="2024-03-04 15:04:46.064" level=info msg="file name:1_bfbd05a0-eff5-4275-a0db-7adc9d9bab66.json, chunk size:712" func=func1 file="file.go:248"
time="2024-03-04 15:04:48.209" level=info msg="jobuuid: 88911b11-5f3a-46a8-bb70-5efa55afdbf4 successfully submitted to IPFS" func=submitJob file="cp_service.go:155"
time="2024-03-04 15:04:48.471" level=info msg="submit job detail: {UUID:88911b11-5f3a-46a8-bb70-5efa55afdbf4 Name:Job-88911b11-5f3a-46a8-bb70-5efa55afdbf4 Status:submitted Duration:3600 JobSourceURI:https://api.lagrangedao.org/spaces/201d7532-8284-4ebd-b348-ff5ce862beda JobResultURI:https://7d67303d2964.acl.swanipfs.com/ipfs/QmWA4aFGaH7FrBA9qMNgcX2ZA9rvW7ywH6uAVuDTPZmYeB StorageSource:lagrange TaskUUID:7db9916e-397f-493d-b20f-0e5688042897 CreatedAt:1709561082 UpdatedAt:1709561082 BuildLog:wss://log.bitstakehaven.com:8085/api/v1/computing/lagrange/spaces/log?space_id=201d7532-8284-4ebd-b348-ff5ce862beda&type=build ContainerLog:wss://log.bitstakehaven.com:8085/api/v1/computing/lagrange/spaces/log?space_id=201d7532-8284-4ebd-b348-ff5ce862beda&type=container}" func=ReceiveJob file="cp_service.go:122"
[GIN] 2024/03/04 - 15:04:48 | 200 |  6.075421328s |   38.104.153.43 | POST     "/api/v1/computing/lagrange/jobs"

init still borken

rm private_key
computing-provider init --ownerAddress 0xfe017Ff8F0C7349845Ab52E58FcA96143f2c4981 --beneficiaryAddress 0x269EBeee083CE6f70486a67dC8036A889bF322A9
Contract deployed! Address: 0xa1468F15CF82f939ed0A96b5A89EFCf92021f1da
Transaction hash: 0x0d0ab9afe5caab575eb02c84bfd8698bc45c466ac940f5915c9a5bb22370b20c
Error: register cp to ubi hub failed, error: cpAccount client create GetCpAccountInfo tx error: no contract code at given address

collateral still inaccurate

HUB info for beneficiary: 8.65 sETH HUB info for owner: 1.49 sETH computing-provider info: Collateral(SWAN-ETH): 0.95000

Normalnoise commented 4 months ago

@sonic-chain please track this issue

sonic-chain commented 4 months ago

Solution:

ThomasBlock commented 4 months ago

Solution:

  • ubi task not working:

    • update resource-exporter,This component gets the CPU model of the machine (amd or intel) and pulls different ubi-task images.

    • kubectl delete ds -n kube-system resource-exporter-ds

    • docker rmi -f filswan/resource-exporter:v11.2.5

    • Reinstall resource-exporter. Refer to: [install-the-hardware-resource-exporter]

  • error 400 and Error building Docker image:

    • Check whether the space has a dockerfile file, or check whether the dockerfile format is correct
    • Check whether the user running cp can run docker commands normally
  • init still borken:

    • There is a delay after the contract is deployed and synchronized to the chain. The server has a retry mechanism to obtain the cp information in the contract. If you can receive the ubi-task task, you can ignore this error. @ThomasBlock

thank you for the reply. i dont't want to be rude. but it always looks like you reply to me with "standard phrases" and dont really look into the problem. its also very slow. it looks like this whole team is really uninterested, or overworked!

i am running this since several weeks. of course i know that docker runs.

i updated resource-exporter on all nodes. ubi task is still broken

time="2024-03-05 09:55:31.728" level=info msg="receive ubi task received: {ID:29710 Name:1000-0-7-19330 Type:0 ZkType:fil-c2-512M InputParam:https://download.fogmetalabs.com/ipfs/QmT7Q8AWBziSbWY8XMZt964tqP48QZncMEvADUhkXRbXPP Signature:0x130502f6fb2adf3d88ab8c3751274b44388b6b959085e86608c4594ff50abcfd13d0b52ccd0f6f212a208f095bd148960399673f34435137b9851b782cc96dd701 Resource:0xc000b078c0}" func=DoUbiTask file="cp_service.go:563"
time="2024-03-05 09:55:31.729" level=info msg="ubi task sign verifing, task_id: 29710, type: fil-c2-512M, verify: true" func=DoUbiTask file="cp_service.go:603"
time="2024-03-05 09:55:31.762" level=info msg="checkResourceAvailableForUbi: needCpu: 1, needMemory: 5.00, needStorage: 1.00" func=checkResourceAvailableForUbi file="cp_service.go:1302"
time="2024-03-05 09:55:31.762" level=info msg="checkResourceAvailableForUbi: remainingCpu: 4, remainingMemory: 13.00, remainingStorage: 353.00" func=checkResourceAvailableForUbi file="cp_service.go:1303"
[GIN] 2024/03/05 - 09:55:31 | 200 |   33.692995ms |   38.104.153.43 | POST     "/api/v1/computing/cp/ubi"
time="2024-03-05 09:55:31.770" level=error msg="Failed creating ubi task job: Job.batch \"fil-c2-512m-29710\" is invalid: spec.template.spec.containers[0].image: Required value" func=func1 file="cp_service.go:845"

invalid reference format - i posted you all the logs - please tell me if the space itself is broken. i think its okay as i ran it in the past. can you resommend a "good" space so that we can really test gpu?

@sonic-chain @Normalnoise ?

Normalnoise commented 3 months ago

time="2024-03-05 09:55:31.770" level=error msg="Failed creating ubi task job: Job.batch \"fil-c2-512m-29710\" is invalid: spec.template.spec.containers[0].image: Required value" func=func1 file="cp_service.go:845"

@ThomasBlock this issue should be your CP is still using the old version's resource-exporter, so you need re-install it following the steps:

Normalnoise commented 3 months ago

i updated resource-exporter on all nodes. ubi task is still broken

please give me your K8s cluster information, including how many nodes in your K8s cluster, master and every node information(like GPU type, cpu type)

sonic-chain commented 3 months ago

Please execute the following command in your k8s cluster and post the log:

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-xxx 
ThomasBlock commented 3 months ago

i updated resource-exporter on all nodes. ubi task is still broken

please give me your K8s cluster information, including how many nodes in your K8s cluster, master and every node information(like GPU type, cpu type)

kubectl get nodes NAME STATUS ROLES AGE VERSION swan1 Ready control-plane 47d v1.28.2 swan3 Ready 42d v1.28.2 swan5 Ready 40d v1.28.2 swan6 Ready 16d v1.28.2 swan7 Ready 15d v1.28.2

swan1 = cpu ryzen7, master swan3 = ryzen7 + 4090 swan5,swan6 = cpu Xeon swan7 = ryzen7 + 3090

ThomasBlock commented 3 months ago

time="2024-03-05 09:55:31.770" level=error msg="Failed creating ubi task job: Job.batch "fil-c2-512m-29710" is invalid: spec.template.spec.containers[0].image: Required value" func=func1 file="cp_service.go:845"

@ThomasBlock this issue should be your CP is still using the old version's resource-exporter, so you need re-install it following the steps:

its by the way really odd that you dont increment the version number, that is confusing. but will do it again: the docker command is not working for me, is the following also okay?

kubectl delete ds -n kube-system resource-exporter-ds
daemonset.apps "resource-exporter-ds" deleted

docker rmi -f filswan/resource-exporter:v11.2.5
Error response from daemon: No such image: filswan/resource-exporter:v11.2.5

ctr -n k8s.io images remove docker.io/filswan/resource-exporter:v11.2.5
docker.io/filswan/resource-exporter:v11.2.5

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: kube-system
  name: resource-exporter-ds
  labels:
    app: resource-exporter
spec:
  selector:
    matchLabels:
      app: resource-exporter
  template:
    metadata:
      labels:
        app: resource-exporter
    spec:
      containers:
      - name: resource-exporter
        image: filswan/resource-exporter:v11.2.5
        imagePullPolicy: IfNotPresent
EOF
daemonset.apps/resource-exporter-ds created
ThomasBlock commented 3 months ago

Please execute the following command in your k8s cluster and post the log:

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-xxx 
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-6lssn
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 4090","fb_memory_usage":{"total":"24564 MiB","used":"347 MiB","free":"24216 MiB"},"bar1_memory_usage":{"total":"32768 MiB","used":"2 MiB","free":"32765 MiB"}}]},"cpu_name":"AMD"}

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-9lcxf 
If the node has a GPU, this error can be ignored. ERROR:: unable to initialize NVML: 12 

 kubectl logs --tail=1 -fn kube-system resource-exporter-ds-c48zq
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 3090","fb_memory_usage":{"total":"24576 MiB","used":"317 MiB","free":"24258 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"2 MiB","free":"253 MiB"}}]},"cpu_name":"AMD"}

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-k8xfq
If the node has a GPU, this error can be ignored. ERROR:: unable to initialize NVML: 12

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-rnwxz
If the node has a GPU, this error can be ignored. ERROR:: unable to initialize NVML: 12 
sonic-chain commented 3 months ago

The resource-exporter pod of a node on your cluster cannot obtain resource information and reports an error. You can set the RUST_GPU_TOOLS_CUSTOM_GPU content to 3090 or 4090 in the fil-c2.env file to let k8s schedule it to these normal nodes.

sonic-chain commented 3 months ago

I fixed the situation where resource-exporter could not be retrieved without nvidia-plugin installed. I would like to trouble you to upgrade the version to filswan/resource-exporter:v11.2.6 again.

ThomasBlock commented 3 months ago

I fixed the situation where resource-exporter could not be retrieved without nvidia-plugin installed. I would like to trouble you to upgrade the version to filswan/resource-exporter:v11.2.6 again.


kubectl delete ds -n kube-system resource-exporter-ds
daemonset.apps "resource-exporter-ds" deleted

5x
ctr -n k8s.io images remove docker.io/filswan/resource-exporter:v11.2.5

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: kube-system
  name: resource-exporter-ds
  labels:
    app: resource-exporter
spec:
  selector:
    matchLabels:
      app: resource-exporter
  template:
    metadata:
      labels:
        app: resource-exporter
    spec:
      containers:
      - name: resource-exporter
        image: filswan/resource-exporter:v11.2.6
        imagePullPolicy: IfNotPresent
EOF
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-5r2rv 
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 3090","fb_memory_usage":{"total":"24576 MiB","used":"320 MiB","free":"24255 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"3 MiB","free":"252 MiB"}}]},"cpu_name":"AMD"}

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-jdpn4
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 4090","fb_memory_usage":{"total":"24564 MiB","used":"347 MiB","free":"24216 MiB"},"bar1_memory_usage":{"total":"32768 MiB","used":"2 MiB","free":"32765 MiB"}}]},"cpu_name":"AMD"}

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-9dg2z
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-ljzbs
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"}
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-x2jqq
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}

some dont report a cpu. but this might make sense: i checked, some of them are virtual amchines with "generic" proxmox type. Some are type "host", here we can see "cpu_name":"INTEL" and "cpu_name":"AMD"

ThomasBlock commented 3 months ago

The resource-exporter pod of a node on your cluster cannot obtain resource information and reports an error. You can set the RUST_GPU_TOOLS_CUSTOM_GPU content to 3090 or 4090 in the fil-c2.env file to let k8s schedule it to these normal nodes.

ok thank you i did this. will this help me to get 32 G ubi tasks? until now i only receive 512M tasks..

RUST_GPU_TOOLS_CUSTOM_GPU="NVIDIA GeForce RTX 3090:10496,NVIDIA GeForce RTX 4090:16384"

sonic-chain commented 3 months ago

There are only 512M tasks now, and the 32G tasks are gone now, but there will be more in the future.

sonic-chain commented 3 months ago

I fixed the situation where resource-exporter could not be retrieved without nvidia-plugin installed. I would like to trouble you to upgrade the version to filswan/resource-exporter:v11.2.6 again.


kubectl delete ds -n kube-system resource-exporter-ds
daemonset.apps "resource-exporter-ds" deleted

5x
ctr -n k8s.io images remove docker.io/filswan/resource-exporter:v11.2.5

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: kube-system
  name: resource-exporter-ds
  labels:
    app: resource-exporter
spec:
  selector:
    matchLabels:
      app: resource-exporter
  template:
    metadata:
      labels:
        app: resource-exporter
    spec:
      containers:
      - name: resource-exporter
        image: filswan/resource-exporter:v11.2.6
        imagePullPolicy: IfNotPresent
EOF
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-5r2rv 
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 3090","fb_memory_usage":{"total":"24576 MiB","used":"320 MiB","free":"24255 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"3 MiB","free":"252 MiB"}}]},"cpu_name":"AMD"}

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-jdpn4
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 4090","fb_memory_usage":{"total":"24564 MiB","used":"347 MiB","free":"24216 MiB"},"bar1_memory_usage":{"total":"32768 MiB","used":"2 MiB","free":"32765 MiB"}}]},"cpu_name":"AMD"}

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-9dg2z
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-ljzbs
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"}
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-x2jqq
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}

some dont report a cpu. but this might make sense: i checked, some of them are virtual amchines with "generic" proxmox type. Some are type "host", here we can see "cpu_name":"INTEL" and "cpu_name":"AMD"

ubi-task needs to pull different docker images according to different CPU architectures. Currently, it only supports intel and amd. Other architectures are not supported for the time being.

ThomasBlock commented 3 months ago

I fixed the situation where resource-exporter could not be retrieved without nvidia-plugin installed. I would like to trouble you to upgrade the version to filswan/resource-exporter:v11.2.6 again.


kubectl delete ds -n kube-system resource-exporter-ds
daemonset.apps "resource-exporter-ds" deleted

5x
ctr -n k8s.io images remove docker.io/filswan/resource-exporter:v11.2.5

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: kube-system
  name: resource-exporter-ds
  labels:
    app: resource-exporter
spec:
  selector:
    matchLabels:
      app: resource-exporter
  template:
    metadata:
      labels:
        app: resource-exporter
    spec:
      containers:
      - name: resource-exporter
        image: filswan/resource-exporter:v11.2.6
        imagePullPolicy: IfNotPresent
EOF
kubectl logs --tail=1 -fn kube-system resource-exporter-ds-5r2rv 
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 3090","fb_memory_usage":{"total":"24576 MiB","used":"320 MiB","free":"24255 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"3 MiB","free":"252 MiB"}}]},"cpu_name":"AMD"}

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-jdpn4
{"gpu":{"driver_version":"535.161.07","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA 4090","fb_memory_usage":{"total":"24564 MiB","used":"347 MiB","free":"24216 MiB"},"bar1_memory_usage":{"total":"32768 MiB","used":"2 MiB","free":"32765 MiB"}}]},"cpu_name":"AMD"}

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-9dg2z
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-ljzbs
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"}
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.

kubectl logs --tail=1 -fn kube-system resource-exporter-ds-x2jqq
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":""}

some dont report a cpu. but this might make sense: i checked, some of them are virtual amchines with "generic" proxmox type. Some are type "host", here we can see "cpu_name":"INTEL" and "cpu_name":"AMD"

ubi-task needs to pull different docker images according to different CPU architectures. Currently, it only supports intel and amd. Other architectures are not supported for the time being.

okay fine. we now clarified that VMs should use a "host" processor type. any further comments on my setup?

my ubi tasks are working now. proofs are submitted onchain. But in the last week there have been no payouts. for me and other users.

image

image

Normalnoise commented 1 month ago

if you have not received the rewards, maybe there are some reasons: