swanchain / go-computing-provider

A golang implementation of computing provider
MIT License
11 stars 15 forks source link

no new tasks / NodeId mismatch #27

Closed ThomasBlock closed 3 months ago

ThomasBlock commented 4 months ago

Today the networks looks very idle

image

until yesterday i could assign jobs to my compute-provider via lagrange.. but today it does no longer work.. is there a general problem with lagrange?

i changed the variable as requested ( now the Collateral(SWAN-ETH)is quite static and oes no longer change ) SWAN_COLLATERAL_CONTRACT="0xdc200f89258e72aC3602dD23BD3642C4bd4eE34e"

Ubi tasks and connectivity are good

13932   GPU         fil-c2-512M 0xd116ba19aff44d03bb506094fd685b34906ef327f72948ff04f72eb18938fbc3  success 10.00   2024-02-24 09:00:11 
13942   GPU         fil-c2-512M 0x1e34babe08e1efec8a25282cab926bcf8022ecb0e8ba9d6284871cbcfb08deb3  success 10.00   2024-02-24 11:00:11 
13952   GPU         fil-c2-512M 0x74eafb3e7e8d79e826536149e877e63135603c603ce2785c4ff6fb923ca79b48  success 10.00   2024-02-24 13:00:11 
14063   GPU         fil-c2-512M 0xb99b653de0c88fcfcf44efd92e82ce60bdaea64c221fd3a62a4fbe8333492690  success 10.00   2024-02-24 15:00:11 
14073   GPU         fil-c2-512M 0x23be663d4797ae8720cbad75e486dd57e89ff2c01d7441c1b90df282d7062102  success 10.00   2024-02-24 17:00:11 
14083   GPU         fil-c2-512M 0x7a2ba814bd225d3ed4d1b41da066b4332f97472b3d5d6d8c4b959da658dcec75  success 0.0     2024-02-24 19:00:11 
[GIN] 2024/02/24 - 20:27:48 | 200 |       18.71µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
[GIN] 2024/02/24 - 20:29:48 | 200 |      15.329µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
[GIN] 2024/02/24 - 20:31:48 | 200 |       11.01µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
[GIN] 2024/02/24 - 20:33:48 | 200 |       11.95µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
[GIN] 2024/02/24 - 20:35:48 | 200 |       15.06µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
[GIN] 2024/02/24 - 20:37:49 | 200 |        20.1µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"

i have seen this error in info NodeId mismatch, local node id: 044dd69a713349917f376a77ba832accaa60422f5d70484f2c82d62a87b42c81d1eab1113cfa188e350df03e8cf734a67d14611953f9b92353911b68e45bc27d3d, chain node id: .

so i redeployed. but also error message here

computing-provider init --ownerAddress 0xfe017Ff8F0C7349845Ab52E58FcA96143f2c4981 --beneficiaryAddress **
Contract deployed! Address: 0x1f0a504782b952fC1aEbC50cE9eD61E6e2cC7EE1
Transaction hash: 0xa54102ff8bd36d60c9c6979e67cbe22538d32852ea52750aa449f34f403d75e4
Error: register cp to ubi hub failed

now the info seems good. but still no new jobs

computing-provider info
Owner:                  0xfe017Ff8F0C7349845Ab52E58FcA96143f2c4981                                                                                         
Contract Address:       0x1f0a504782b952fC1aEbC50cE9eD61E6e2cC7EE1                                                                                          
Multi-Address:          /ip4/**/tcp/8085                                                                                                        
Name:                   ThomasBlock.io-Test                                                                                                                 
Node ID:                044dd69a713349917f376a77ba832accaa60422f5d70484f2c82d62a87b42c81d1eab1113cfa188e350df03e8cf734a67d14611953f9b92353911b68e45bc27d3d  
Domain:                 **.com                                                                                                                  
Running deployments:    14                                                                                                                                  
Available(SWAN-ETH):    0.98691                                                                                                                             
Collateral(SWAN-ETH):   0.95000                                                                                                                             
UBI FLAG:               Accept                                                                                                                              
Beneficiary Address:    **   
ThomasBlock commented 4 months ago

its also noteworthy that the tasks no longer are refundable. they are "completed" after just some minutes

image

ThomasBlock commented 4 months ago

@Normalnoise

still no new jobs...

image

Normalnoise commented 4 months ago
Error: register cp to ubi hub failed

the error shows: you have not report your cp info to the ubi-engine, so you can not get the ubi task; you must re-init the cp account to ensure no error happen

Normalnoise commented 4 months ago

its also noteworthy that the tasks no longer are refundable. they are "completed" after just some minutes

image

@ThomasBlock I have noticed this issue, it is a bug, we need more time to fix it. if you need more swan token, please let me know. the refund function will be fixed as soon as possible

ThomasBlock commented 4 months ago
Error: register cp to ubi hub failed

the error shows: you have not report your cp info to the ubi-engine, so you can not get the ubi task; you must re-init the cp account to ensure no error happen

so what should i do? i ran it again , same error

computing-provider init --ownerAddress 0xfe017Ff8F0C7349845Ab52E58FcA96143f2c4981 --beneficiaryAddress 0x269EBeee083CE6f70486a67dC8036A889bF322A9
Contract deployed! Address: 0x8A878316d185a05edF4A63E92B81737d807E8762
Transaction hash: 0x33d1550790fe4f6c2f4aae2f61c77ef4148690d173a477f63675a8e02957cd8a
Error: register cp to ubi hub failed

it really seems liek there is a general problem with the hub, as other discord members report. can you check that?

Normalnoise commented 4 months ago

you can try it again:

ThomasBlock commented 4 months ago

you can try it again:

  • delete the $CP_PATH/privateKey
  • re-init computing-provider account to ensure there is no error.

i tried that now 4 times. everytime i see Error: register cp to ubi hub failed

Is there anything else which needs to be done after the Collateral change? i wrote in my config file SWAN_COLLATERAL_CONTRACT="0xdc200f89258e72aC3602dD23BD3642C4bd4eE34e"

but the collateral differes between the hub and my software - so where could be the problem here?

Bildschirmfoto vom 2024-02-27 10-37-18

Bildschirmfoto vom 2024-02-27 10-37-03

Bildschirmfoto vom 2024-02-27 10-35-57

ThomasBlock commented 4 months ago

Ha..2 hours ago something changed in the network: my tasks jumped from 10 to 110. Collateral still 0.95000, but i guess something good happened on the network!

seems to be on whole network:

image

on the other hand this is bas all the cpu cores are now blocked and so we still cannot lease GPUs..

Normalnoise commented 4 months ago

we have fixed some issues, can you try it again to lease your GPU

ThomasBlock commented 4 months ago

we have fixed some issues, can you try it again to lease your GPU

thank you for the update. yes something changed. the spam taks disappeared recently. i could start the task on langrange. now there are new errors. seen this for two different deployments. did work earlier.

Error building Docker image: Error response from daemon: invalid reference format Failed to extract exposed port: unable to open Dockerfile: open : no such file or director

time="2024-02-28 18:35:04.312" level=info msg="Job received Data: {UUID:cba67825-22b6-4b98-9aee-c70a50662303 Name:Job-cba67825-22b6-4b98-9aee-c70a50662303 Status:Submitted Duration:3600 JobSourceURI:https://api.lagrangedao.org/spaces/201d7532-8284-4ebd-b348-ff5ce862beda JobResultURI: StorageSource:lagrange TaskUUID:12e1f0a6-aece-4f4a-a3bc-da2e66ffc782 CreatedAt:1709141704 UpdatedAt:1709141704 BuildLog: ContainerLog:}" func=ReceiveJob file="cp_service.go:79"
time="2024-02-28 18:35:05.894" level=info msg="checkResourceAvailableForSpace: needCpu: 8, needMemory: 16.00, needStorage: 20.00" func=checkResourceAvailableForSpace file="cp_service.go:1210"
time="2024-02-28 18:35:05.894" level=info msg="checkResourceAvailableForSpace: remainingCpu: 2, remainingMemory: 16.00, remainingStorage: 348.00" func=checkResourceAvailableForSpace file="cp_service.go:1211"
time="2024-02-28 18:35:05.894" level=info msg="checkResourceAvailableForSpace: needCpu: 8, needMemory: 16.00, needStorage: 20.00" func=checkResourceAvailableForSpace file="cp_service.go:1210"
time="2024-02-28 18:35:05.894" level=info msg="checkResourceAvailableForSpace: remainingCpu: 10, remainingMemory: 53.00, remainingStorage: 1568.00" func=checkResourceAvailableForSpace file="cp_service.go:1211"
time="2024-02-28 18:35:05.894" level=info msg="gpuName: NVIDIA-4090, nodeGpu: map[:0 kubernetes.io/os:0], nodeGpuSummary: map[swan2:map[NVIDIA-A4000:1] swan3:map[NVIDIA-4090:1] swan7:map[NVIDIA-3090:1] swan8:map[NVIDIA-A6000:1]]" func=checkResourceAvailableForSpace file="cp_service.go:1217"
time="2024-02-28 18:35:05.894" level=info msg="checkResourceAvailableForSpace: needCpu: 8, needMemory: 16.00, needStorage: 20.00" func=checkResourceAvailableForSpace file="cp_service.go:1210"
time="2024-02-28 18:35:05.894" level=info msg="checkResourceAvailableForSpace: remainingCpu: 23, remainingMemory: 65.00, remainingStorage: 1593.00" func=checkResourceAvailableForSpace file="cp_service.go:1211"
time="2024-02-28 18:35:05.894" level=info msg="gpuName: NVIDIA-4090, nodeGpu: map[:0 kubernetes.io/os:0], nodeGpuSummary: map[swan2:map[NVIDIA-A4000:1] swan3:map[NVIDIA-4090:1] swan7:map[NVIDIA-3090:1] swan8:map[NVIDIA-A6000:1]]" func=checkResourceAvailableForSpace file="cp_service.go:1217"
time="2024-02-28 18:35:05.895" level=info msg="submitting job..." func=submitJob file="cp_service.go:124"
time="2024-02-28 18:35:05.895" level=info msg="uploading file to bucket, objectName: jobs/c094542f-3962-4633-94be-2963439c8165.json, filePath: /tmp/jobs/c094542f-3962-4633-94be-2963439c8165.json" func=UploadFileToBucket file="storage_service.go:52"
time="2024-02-28 18:35:06.808" level=info msg="uuid: 201d7532-8284-4ebd-b348-ff5ce862beda, spaceName: myDiffusion, hardwareName: Nvidia 4090 · 8 vCPU · 16 GiB" func=DeploySpaceTask file="cp_service.go:1019"
time="2024-02-28 18:35:07.013" level=error msg="http status: 400 Bad Request, code:400, url:https://api.multichain.storage/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/c094542f-3962-4633-94be-2963439c8165.json" func=HttpRequest file="restful.go:127"
time="2024-02-28 18:35:07.013" level=error msg="https://api.multichain.storage/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/c094542f-3962-4633-94be-2963439c8165.json failed, status:error, message:invalid param value:record not found" func=HttpRequest file="restful.go:154"
time="2024-02-28 18:35:07.013" level=error msg="https://api.multichain.storage/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/c094542f-3962-4633-94be-2963439c8165.json failed, status:error, message:invalid param value:record not found" func=HttpGet file="restful.go:64"
time="2024-02-28 18:35:07.013" level=error msg="https://api.multichain.storage/api/v2/oss_file/get_file_by_object_name?bucket_uid=878494a8-6ab7-4694-96ac-fc89a2afcbe1&object_name=jobs/c094542f-3962-4633-94be-2963439c8165.json failed, status:error, message:invalid param value:record not found" func=GetFile file="file.go:56"
time="2024-02-28 18:35:07.167" level=info msg="Download 201d7532-8284-4ebd-b348-ff5ce862beda successfully." func=BuildSpaceTaskImage file="buildspace.go:33"
time="2024-02-28 18:35:07.278" level=info msg="Download 201d7532-8284-4ebd-b348-ff5ce862beda successfully." func=BuildSpaceTaskImage file="buildspace.go:33"
time="2024-02-28 18:35:07.391" level=info msg="Download 201d7532-8284-4ebd-b348-ff5ce862beda successfully." func=BuildSpaceTaskImage file="buildspace.go:33"
2024/02/28 18:35:07 Image path: build/0x7B0CEe1939a4AdA062EC79f4862a42C1F47B1806/spaces/myDiffusion
time="2024-02-28 18:35:07.392" level=error msg="Error building Docker image: Error response from daemon: invalid reference format" func=BuildImagesByDockerfile file="buildspace.go:80"
time="2024-02-28 18:35:07.392" level=info msg="Failed to extract exposed port: unable to open Dockerfile: open : no such file or directory" func=DockerfileToK8s file="deploy.go:91"
time="2024-02-28 18:35:08.303" level=info msg="file name:1_c094542f-3962-4633-94be-2963439c8165.json, chunk size:712" func=func1 file="file.go:217"
time="2024-02-28 18:35:09.313" level=info msg="Delete redis keys finished, keys: [FULL:201d7532-8284-4ebd-b348-ff5ce862beda]" func=1 file="task_service.go:286"
time="2024-02-28 18:35:10.480" level=info msg="jobuuid: cba67825-22b6-4b98-9aee-c70a50662303 successfully submitted to IPFS" func=submitJob file="cp_service.go:152"
time="2024-02-28 18:35:10.743" level=info msg="submit job detail: {UUID:cba67825-22b6-4b98-9aee-c70a50662303 Name:Job-cba67825-22b6-4b98-9aee-c70a50662303 Status:submitted Duration:3600 JobSourceURI:https://api.lagrangedao.org/spaces/201d7532-8284-4ebd-b348-ff5ce862beda JobResultURI:https://7d67303d2964.acl.multichain.storage/ipfs/QmWuN7LhTa2Bw6Fg3jwatUZVjd3D1a42haJAuSzMu5FVeK StorageSource:lagrange TaskUUID:12e1f0a6-aece-4f4a-a3bc-da2e66ffc782 CreatedAt:1709141704 UpdatedAt:1709141705 BuildLog:wss://log.bitstakehaven.com:8085/api/v1/computing/lagrange/spaces/log?space_id=201d7532-8284-4ebd-b348-ff5ce862beda&type=build ContainerLog:wss://log.bitstakehaven.com:8085/api/v1/computing/lagrange/spaces/log?space_id=201d7532-8284-4ebd-b348-ff5ce862beda&type=container}" func=ReceiveJob file="cp_service.go:119"
[GIN] 2024/02/28 - 18:35:10 | 200 |  6.431254455s |   38.104.153.43 | POST     "/api/v1/computing/lagrange/jobs"

minesweeper on the other hand will deploy, but then crash in kubernetes:

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  2m5s                default-scheduler  Successfully assigned ns-0x7b0cee1939a4ada062ec79f4862a42c1f47b1806/deploy-5d9dc1e2-877b-4e74-8504-03bf195e1af0-6b4dd877cc-kfdmp to swan3
  Normal   Pulled     29s (x5 over 2m6s)  kubelet            Container image "creepto/minesweeper" already present on machine
  Normal   Created    29s (x5 over 2m6s)  kubelet            Created container 5d9dc1e2-877b-4e74-8504-03bf195e1af0-minesweeper
  Warning  Failed     28s (x5 over 2m6s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown
  Warning  BackOff  15s (x10 over 2m4s)  kubelet  Back-off restarting failed container 5d9dc1e2-877b-4e74-8504-03bf195e1af0-minesweeper in pod deploy-5d9dc1e2-877b-4e74-8504-03bf195e1af0-6b4dd877cc-kfdmp_ns-0x7b0cee1939a4ada062ec79f4862a42c1f47b1806(0896aa90-c73a-4ba9-bc31-cd825a117ebc)

is there anything i need to do on my side? like updating components etc?

Normalnoise commented 4 months ago

because our storage service is maintaining, the Lagrange and space deployment is not available, but the ubi task can be done normally.

Normalnoise commented 4 months ago

once the maintenance completed, there will be a announcement in the community in the discord

Normalnoise commented 3 months ago

please upgrade to https://github.com/swanchain/go-computing-provider/releases/tag/v0.4.5