Closed ThomasBlock closed 6 months ago
Is the image version of resource-exporter filswan/resource-exporter:v11.2.6? Can you post the pod log corresponding to the machine resource-exporter?
Is the image version of resource-exporter filswan/resource-exporter:v11.2.6? Can you post the pod log corresponding to the machine resource-exporter?
yes i have v11.2.6
NAME STATUS ROLES AGE VERSION
node1 Ready control-plane 3d1h v1.27.7
node2 Ready <none> 3d1h v1.27.7
node3 Ready <none> 3d1h v1.27.7
kubectl describe nodes | grep Taints
Taints: role=blocked:NoSchedule
Taints: <none>
Taints: <none>
i see this log in cp, altough i am root and we are using kubernetes.. what could be missing?
time="2024-05-02 20:46:29.818" level=error msg="Failed get image list, error: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" func=CleanResource file="docker_service.go:319"
node1: only ingress, tainted
node2: Intel, no GPU
kubectl logs resource-exporter-ds-4kxrk -n kube-system
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"}
node 3: AMD + GPU
nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-976b5f8c-fbec-09a2-5aae-1c0b6dffe3ce)
kubectl logs -n kube-system resource-exporter-ds-rjbhq
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"AMD"}
kubectl logs -n kube-system nvidia-device-plugin-daemonset-s852c
I0429 18:04:40.162448 1 main.go:178] Starting FS watcher.
I0429 18:04:40.162508 1 main.go:185] Starting OS watcher.
I0429 18:04:40.162742 1 main.go:200] Starting Plugins.
I0429 18:04:40.162749 1 main.go:257] Loading configuration.
I0429 18:04:40.162962 1 main.go:265] Updating config with default resource matching patterns.
I0429 18:04:40.163438 1 main.go:276]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"mpsRoot": "",
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0429 18:04:40.163450 1 main.go:279] Retrieving plugins.
W0429 18:04:40.163840 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0429 18:04:40.163876 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0429 18:04:40.163905 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0429 18:04:40.163913 1 factory.go:112] Incompatible platform detected
E0429 18:04:40.163916 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0429 18:04:40.163920 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0429 18:04:40.163923 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0429 18:04:40.163930 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0429 18:04:40.163935 1 main.go:308] No devices found. Waiting indefinitely.
The nvidia-device-plugin above has given a prompt. On some machines, the nvidia environment has not been installed properly. You can install the prompt to reinstall and use the official test method to verify. Refer to https://github.com/NVIDIA/k8s-device-plugin#quick-start
The nvidia-device-plugin above has given a prompt. On some machines, the nvidia environment has not been installed properly. You can install the prompt to reinstall and use the official test method to verify. Refer to https://github.com/NVIDIA/k8s-device-plugin#quick-start
Yes thank you i could fix GPU with
nano /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd]
#default_runtime_name = "runc"
default_runtime_name = "nvidia"
{"gpu":{"driver_version":"535.171.04","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA A4000","fb_memory_usage":{"total":"16376 MiB","used":"268 MiB","free":"16107 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"2 MiB","free":"253 MiB"}}]},"cpu_name":"AMD"}
but something is still odd.. CPU still not recognized.. still no jobs.. only one node showed on orchestrator
{
"node_id": "040b84e0b707b7860f688837933fc79c21b403e8b4a99981daff08a86a52dd14e23a7c37befab3858633678a325b10a90467fb83d73424cbd022bff17ddc5863cd",
"region": "North Rhine-Westphalia-DE",
"cluster_info": [
{
"machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
"cpu_name": "",
"cpu": {
"total": "10",
"used": "7",
"free": "3"
},
"vcpu": {
"total": "10",
"used": "7",
"free": "3"
},
"memory": {
"total": "11.00 GiB",
"used": "0.00 GiB",
"free": "11.00 GiB"
},
"gpu": {
"driver_version": "",
"cuda_version": "",
"attached_gpus": 0,
"details": null
},
"storage": {
"total": "176.00 GiB",
"used": "0.00 GiB",
"free": "176.00 GiB"
}
},
{
"machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
"cpu_name": "",
"cpu": {
"total": "64",
"used": "6",
"free": "58"
},
"vcpu": {
"total": "64",
"used": "6",
"free": "58"
},
"memory": {
"total": "126.00 GiB",
"used": "0.00 GiB",
"free": "126.00 GiB"
},
"gpu": {
"driver_version": "",
"cuda_version": "",
"attached_gpus": 0,
"details": []
},
"storage": {
"total": "437.00 GiB",
"used": "0.00 GiB",
"free": "437.00 GiB"
}
},
{
"machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
"cpu_name": "",
"cpu": {
"total": "24",
"used": "3",
"free": "21"
},
"vcpu": {
"total": "24",
"used": "3",
"free": "21"
},
"memory": {
"total": "31.00 GiB",
"used": "0.00 GiB",
"free": "31.00 GiB"
},
"gpu": {
"driver_version": "535.171.04",
"cuda_version": "12020",
"attached_gpus": 1,
"details": [
{
"product_name": "NVIDIA A4000",
"status": "available",
"fb_memory_usage": {
"total": "16376 MiB",
"used": "268 MiB",
"free": "16107 MiB"
},
"bar1_memory_usage": {
"total": "256 MiB",
"used": "2 MiB",
"free": "253 MiB"
}
}
]
},
"storage": {
"total": "437.00 GiB",
"used": "0.00 GiB",
"free": "437.00 GiB"
}
}
],
"multi_address": "/ip4/XXX/tcp/8085",
"node_name": "ThomasBlock.io"
}
so what about this error? i dont have docker istalled, only kubernetes and containerd. do i need docker?
time="2024-05-03 13:58:39.609" level=error msg="Failed get image list, error: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" func=CleanResource file="docker_service.go:319"
yes, Docker is required, use it to build the image, only containerd or docker as the runtime is allowed when running.
yes, Docker is required, use it to build the image, only containerd or docker as the runtime is allowed when running.
okay docker is working now.. no erros.. but also no jobs are accepted.. so what can i do?
i see tls hanshake errors ( everytime 38.104.153.43 connetcs, there are two events - one tls error and one 200 access..) - ssl is woring for me, what could be the problem?
[GIN] 2024/05/04 - 20:11:40 | 200 | 29.702381ms | 38.104.153.43 | GET "/api/v1/computing/cp"
[GIN] 2024/05/04 - 20:14:14 | 200 | 18.121µs | 38.104.153.43 | GET "/api/v1/computing/host/info"
[GIN] 2024/05/04 - 20:14:18 | 200 | 15.251µs | 38.104.153.43 | GET "/api/v1/computing/host/info"
2024/05/04 20:18:40 http: TLS handshake error from 38.104.153.43:46942: remote error: tls: bad certificate
[GIN] 2024/05/04 - 20:18:40 | 200 | 21.623729ms | 38.104.153.43 | GET "/api/v1/computing/cp"
2024/05/04 20:21:31 http: TLS handshake error from 38.104.153.43:54128: remote error: tls: bad certificate
[GIN] 2024/05/04 - 20:21:31 | 200 | 19.620076ms | 38.104.153.43 | GET "/api/v1/computing/cp"
i found this in your docs.. but it is relevant for getting jobs.. or only for printinh job output in browser? i use certbot.,.
Q: What are the requirements for SSL certificates needed in CP?
A: Please use certificates issued by trusted Certificate Authorities (CA). Currently, certificates generated by Certbot are not functioning properly.
Otherwise, the application won't be displayed correctly on the Space App page.
i just noticed that for every provider, the presentation is different on the orchestrator. we only have one machine node on proxmima, whereas we had several on saturn...
i also recognized that users with less than 64 GB get zero jobs..is that just the secret? ram is now enforced?
yes, Docker is required, use it to build the image, only containerd or docker as the runtime is allowed when running.
okay docker is working now.. no erros.. but also no jobs are accepted.. so what can i do?
i see tls hanshake errors ( everytime 38.104.153.43 connetcs, there are two events - one tls error and one 200 access..) - ssl is woring for me, what could be the problem?
[GIN] 2024/05/04 - 20:11:40 | 200 | 29.702381ms | 38.104.153.43 | GET "/api/v1/computing/cp" [GIN] 2024/05/04 - 20:14:14 | 200 | 18.121µs | 38.104.153.43 | GET "/api/v1/computing/host/info" [GIN] 2024/05/04 - 20:14:18 | 200 | 15.251µs | 38.104.153.43 | GET "/api/v1/computing/host/info" 2024/05/04 20:18:40 http: TLS handshake error from 38.104.153.43:46942: remote error: tls: bad certificate [GIN] 2024/05/04 - 20:18:40 | 200 | 21.623729ms | 38.104.153.43 | GET "/api/v1/computing/cp" 2024/05/04 20:21:31 http: TLS handshake error from 38.104.153.43:54128: remote error: tls: bad certificate [GIN] 2024/05/04 - 20:21:31 | 200 | 19.620076ms | 38.104.153.43 | GET "/api/v1/computing/cp"
i found this in your docs.. but it is relevant for getting jobs.. or only for printinh job output in browser? i use certbot.,.
Q: What are the requirements for SSL certificates needed in CP? A: Please use certificates issued by trusted Certificate Authorities (CA). Currently, certificates generated by Certbot are not functioning properly. Otherwise, the application won't be displayed correctly on the Space App page.
there are some issues in the Orchestrator server, we have contact with them to solve it. It is not CP's issue
i just noticed that for every provider, the presentation is different on the orchestrator. we only have one machine node on proxmima, whereas we had several on saturn...
i also recognized that users with less than 64 GB get zero jobs..is that just the secret? ram is now enforced?
where can you find the node on the Saturn?
The orchestrator has some issues, so FCP can not get any new jobs. the orchestrator team will solve it asap
i just noticed that for every provider, the presentation is different on the orchestrator. we only have one machine node on proxmima, whereas we had several on saturn... i also recognized that users with less than 64 GB get zero jobs..is that just the secret? ram is now enforced?
where can you find the node on the Saturn?
i was speaking about the page in the past. here an old screenshot. every kubernetes node is listed with separate CPU, RAM and GPU
The orchestrator has some issues, so FCP can not get any new jobs. the orchestrator team will solve it asap
So what is the timeline for this? @sonic-chain @Normalnoise Atom Accelerator started two weeks ago, and i still could not accept a single FCP job
this issue has been solved for a long time, can't you receive any task?
I think you can follow the steps to check your FCP:
[RPC]
SWAN_TESTNET ="https://rpc-proxima.swanchain.io"
[CONTRACT] SWAN_COLLATERAL_CONTRACT="0xfD9190027cd42Fc4f653Dfd9c4c45aeBAf0ae063"
- secondly, please ensure your owner has enough collateral:
computing-provider collateral info
- thirdly, please follow the steps to test your FCP:
https://docs.swanchain.io/orchestrator/as-a-computing-provider/fcp-fog-computing-provider/faq#q-how-can-i-know-if-the-status-of-the-computing-provider-is-normal
if you still can not get any task, please provide the screenshot and CP's log
@Normalnoise @sonic-chain Thank you for the reply. i checked again every part of my setup.
it really looks like an orchestrator problem, can you check it on that side? node 04e09e3106afc4c878e211c222ae8b0c9640ea47d38d16feddf0630d267eb2c217efe74eda8a71d44b0cb4fd3eec23cc9e6cc2a43bf54f89c08e87150937207da2
[HUB] VerifySign = false
runs successfull: i now have a minesweeper runningwhat i see as a problem:
kubectl logs -n kube-system resource-exporter-ds-kxv96
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"}
i also switched to a bigger setup to make sure i am not limited by the requirement "at least 64 GB RAM"
i can pick my server on lagrange, but still nothing on my compute-provider..
so maybe related to this tls handshake error? why cant you support certbot? it's a widely used system. i have no other way for certificates..
Q: What are the requirements for SSL certificates needed in CP?
A: Please use certificates issued by trusted Certificate Authorities (CA). Currently, certificates generated by Certbot are not functioning properly.
Here some new logs - things are happening, but still no active deploymens..
time="2024-05-22 06:02:41.541" level=info msg="file name:1_6b658295-5949-4263-bd3e-f1dbb145b4a9.json, chunk size:967" func=func1 file="file.go:248"
time="2024-05-22 06:04:47.712" level=info msg="jobuuid: 5951a3a7-7898-47b0-9235-1e12649af534 successfully submitted to IPFS" func=submitJob file="cp_service.go:185"
time="2024-05-22 06:04:47.986" level=info msg="submit job detail: {UUID:5951a3a7-7898-47b0-9235-1e12649af534 Name:Job-6d444ddf-fb40-4eee-a08b-51ac3614f3f2 Status:submitted D
uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI:https://2d2faccf2937.acl.swanipfs.com/ipfs/QmfW9Q
z9SN2aHUzMPE9cn9mdpVPsFRBTYvVKYmTg4u9Cgp StorageSource:lagrange TaskUUID:b5081241-fb91-4ae2-a63c-160a94a672d7 CreatedAt: UpdatedAt:1716350407 BuildLog:wss://log.computeprovi
der.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=build ContainerLog:wss://log.computeprovider.com:8086/api/v1/c
omputing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=container NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc848141
80f8f5e7c9c8069c3a77f6348a82b2e0a9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:https://jkqvam1yyb.computeprovider.com}" func=ReceiveJob file="cp_service.
go:152"
[GIN] 2024/05/22 - 06:04:47 | 200 | 4m40s | 184.147.89.2 | POST "/api/v1/computing/lagrange/jobs"
{"stream":" ---\u003e Running in 9161989555b9\n"}
{"stream":" ---\u003e dc744a185552\n"}
{"aux":{"ID":"sha256:dc744a1855526c8fdcef436646ee92033b45e29b5150e1491224f4c8d887f273"}}
{"stream":"Successfully built dc744a185552\n"}
{"stream":"Successfully tagged lagrange/hello-task-5377de0020c4:1716346793\n"}
time="2024-05-22 05:00:26.367" level=info msg="Start deleting space service, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:782"
time="2024-05-22 05:00:35.383" level=info msg="Deleted space service finished, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:839"
time="2024-05-22 05:00:35.397" level=info msg="Created deployment: deploy-868ab7df-b03f-4c9f-9360-5377de0020c4" func=DockerfileToK8s file="deploy.go:176"
time="2024-05-22 05:01:18.708" level=info msg="file name:1_fde55a1c-41b1-4efe-89c1-9a6aa1ed14bb.json, chunk size:967" func=func1 file="file.go:248"
time="2024-05-22 05:01:34.160" level=info msg="Job received Data: {UUID:88675737-9917-4238-9f25-23372e9a06a4 Name:Job-0ba901b0-3957-4fd1-b9d3-36658bb5988c Status:Submitted D
uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI: StorageSource:lagrange TaskUUID:27887610-1ab1-45
f8-a448-942f9d98a360 CreatedAt: UpdatedAt: BuildLog: ContainerLog: NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a
9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:}" func=ReceiveJob file="cp_service.go:74"
time="2024-05-22 05:01:34.479" level=error msg="space API response not OK. Status Code: 404" func=ReceiveJob file="cp_service.go:106"
[GIN] 2024/05/22 - 05:01:34 | 500 | 319.10993ms | 184.147.89.2 | POST "/api/v1/computing/lagrange/jobs"
time="2024-05-22 04:59:52.670" level=info msg="Job received Data: {UUID:e5608334-2ec5-42c8-b05c-96712826df7c Name:Job-21b372d6-0d04-4798-bfbf-1ce020bd8709 Status:Submitted D
uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI: StorageSource:lagrange TaskUUID:5c9d5bb8-4725-4b
e7-8f56-f0ef114e8118 CreatedAt: UpdatedAt: BuildLog: ContainerLog: NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a
9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:}" func=ReceiveJob file="cp_service.go:74"
[GIN] 2024/05/22 - 04:59:53 | 200 | 21.741µs | 38.104.153.43 | GET "/api/v1/computing/host/info"
time="2024-05-22 04:59:53.022" level=info msg="checkResourceAvailableForSpace: needCpu: 4, needMemory: 4.00, needStorage: 5.00" func=checkResourceAvailableForSpace file="cp_
service.go:921"
time="2024-05-22 04:59:53.022" level=info msg="checkResourceAvailableForSpace: remainingCpu: 57, remainingMemory: 132.00, remainingStorage: 176.00" func=checkResourceAvailab
leForSpace file="cp_service.go:922"
time="2024-05-22 04:59:53.023" level=info msg="submitting job..." func=submitJob file="cp_service.go:157"
time="2024-05-22 04:59:53.023" level=info msg="uploading file to bucket, objectName: jobs/fde55a1c-41b1-4efe-89c1-9a6aa1ed14bb.json, filePath: /tmp/jobs/fde55a1c-41b1-4efe-8
9c1-9a6aa1ed14bb.json" func=UploadFileToBucket file="storage_service.go:52"
time="2024-05-22 04:59:53.129" level=info msg="uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4, spaceName: Hello-Task, hardwareName: CPU only · 4 vCPU · 4 GiB" func=DeploySpaceTa
sk file="cp_service.go:738"
2024/05/22 04:59:53 Image path: build/0xaA5812Fb31fAA6C073285acD4cB185dDbeBDC224/spaces/Hello-Task
{"stream":"Step 1/7 : FROM python:3.9"}
{"stream":"\n"}
{"status":"Pulling from library/python","id":"3.9"}
{"status":"Pulling fs layer","progressDetail":{},"id":"c6cf28de8a06"}
...
{"stream":" ---\u003e f2540758e105\n"}
{"aux":{"ID":"sha256:f2540758e10579627a789725a2b4fc36c916c26783373f055173e5cf7aa1fe9d"}}
{"stream":"Successfully built f2540758e105\n"}
{"stream":"Successfully tagged lagrange/hello-task-5377de0020c4:1716338729\n"}
time="2024-05-22 02:46:00.098" level=info msg="Start deleting space service, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:782"
time="2024-05-22 02:46:01.256" level=error msg="http status: 400 Bad Request, code:400, url:https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json" func=HttpRequest file="restful.go:127"
time="2024-05-22 02:46:01.256" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json failed, status:error, message:invalid param value:record not found" func=HttpRequest file="restful.go:154"
time="2024-05-22 02:46:01.256" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json failed, status:error, message:invalid param value:record not found" func=HttpGet file="restful.go:64"
time="2024-05-22 02:46:09.114" level=info msg="Deleted space service finished, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:839"
time="2024-05-22 02:46:09.133" level=info msg="Created deployment: deploy-868ab7df-b03f-4c9f-9360-5377de0020c4" func=DockerfileToK8s file="deploy.go:176"
time="2024-05-22 02:47:33.213" level=info msg="file name:1_137b654d-8a0d-4d93-a655-12947e41baf6.json, chunk size:967" func=func1 file="file.go:248"
time="2024-05-22 02:48:42.318" level=info msg="jobuuid: aab61ae7-6cb0-4113-b970-088f1f9809a7 successfully submitted to IPFS" func=submitJob file="cp_service.go:185"
time="2024-05-22 02:48:42.588" level=info msg="submit job detail: {UUID:aab61ae7-6cb0-4113-b970-088f1f9809a7 Name:Job-25931c1a-221d-4f50-bc5b-42c0c821bd42 Status:submitted Duration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI:https://2d2faccf2937.acl.swanipfs.com/ipfs/QmZSEKgJh7PR9fairFWesBMWWKjafRqx367wUtF9pGpn6h StorageSource:lagrange TaskUUID:1e69456a-06b4-4743-b760-66cf739adb99 CreatedAt: UpdatedAt:1716338728 BuildLog:wss://log.computeprovider.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=build ContainerLog:wss://log.computeprovider.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=container NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:https://k4ct7uu41b.computeprovider.com}" func=ReceiveJob file="cp_service.go:152"
[GIN] 2024/05/22 - 02:48:42 | 200 | 3m14s | 184.147.89.2 | POST "/api/v1/computing/lagrange/jobs"
Here some new logs - things are happening, but still no active deploymens..
time="2024-05-22 06:02:41.541" level=info msg="file name:1_6b658295-5949-4263-bd3e-f1dbb145b4a9.json, chunk size:967" func=func1 file="file.go:248" time="2024-05-22 06:04:47.712" level=info msg="jobuuid: 5951a3a7-7898-47b0-9235-1e12649af534 successfully submitted to IPFS" func=submitJob file="cp_service.go:185" time="2024-05-22 06:04:47.986" level=info msg="submit job detail: {UUID:5951a3a7-7898-47b0-9235-1e12649af534 Name:Job-6d444ddf-fb40-4eee-a08b-51ac3614f3f2 Status:submitted D uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI:https://2d2faccf2937.acl.swanipfs.com/ipfs/QmfW9Q z9SN2aHUzMPE9cn9mdpVPsFRBTYvVKYmTg4u9Cgp StorageSource:lagrange TaskUUID:b5081241-fb91-4ae2-a63c-160a94a672d7 CreatedAt: UpdatedAt:1716350407 BuildLog:wss://log.computeprovi der.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=build ContainerLog:wss://log.computeprovider.com:8086/api/v1/c omputing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=container NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc848141 80f8f5e7c9c8069c3a77f6348a82b2e0a9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:https://jkqvam1yyb.computeprovider.com}" func=ReceiveJob file="cp_service. go:152" [GIN] 2024/05/22 - 06:04:47 | 200 | 4m40s | 184.147.89.2 | POST "/api/v1/computing/lagrange/jobs"
{"stream":" ---\u003e Running in 9161989555b9\n"} {"stream":" ---\u003e dc744a185552\n"} {"aux":{"ID":"sha256:dc744a1855526c8fdcef436646ee92033b45e29b5150e1491224f4c8d887f273"}} {"stream":"Successfully built dc744a185552\n"} {"stream":"Successfully tagged lagrange/hello-task-5377de0020c4:1716346793\n"} time="2024-05-22 05:00:26.367" level=info msg="Start deleting space service, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:782" time="2024-05-22 05:00:35.383" level=info msg="Deleted space service finished, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:839" time="2024-05-22 05:00:35.397" level=info msg="Created deployment: deploy-868ab7df-b03f-4c9f-9360-5377de0020c4" func=DockerfileToK8s file="deploy.go:176" time="2024-05-22 05:01:18.708" level=info msg="file name:1_fde55a1c-41b1-4efe-89c1-9a6aa1ed14bb.json, chunk size:967" func=func1 file="file.go:248" time="2024-05-22 05:01:34.160" level=info msg="Job received Data: {UUID:88675737-9917-4238-9f25-23372e9a06a4 Name:Job-0ba901b0-3957-4fd1-b9d3-36658bb5988c Status:Submitted D uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI: StorageSource:lagrange TaskUUID:27887610-1ab1-45 f8-a448-942f9d98a360 CreatedAt: UpdatedAt: BuildLog: ContainerLog: NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a 9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:}" func=ReceiveJob file="cp_service.go:74" time="2024-05-22 05:01:34.479" level=error msg="space API response not OK. Status Code: 404" func=ReceiveJob file="cp_service.go:106" [GIN] 2024/05/22 - 05:01:34 | 500 | 319.10993ms | 184.147.89.2 | POST "/api/v1/computing/lagrange/jobs"
time="2024-05-22 04:59:52.670" level=info msg="Job received Data: {UUID:e5608334-2ec5-42c8-b05c-96712826df7c Name:Job-21b372d6-0d04-4798-bfbf-1ce020bd8709 Status:Submitted D uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI: StorageSource:lagrange TaskUUID:5c9d5bb8-4725-4b e7-8f56-f0ef114e8118 CreatedAt: UpdatedAt: BuildLog: ContainerLog: NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a 9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:}" func=ReceiveJob file="cp_service.go:74" [GIN] 2024/05/22 - 04:59:53 | 200 | 21.741µs | 38.104.153.43 | GET "/api/v1/computing/host/info" time="2024-05-22 04:59:53.022" level=info msg="checkResourceAvailableForSpace: needCpu: 4, needMemory: 4.00, needStorage: 5.00" func=checkResourceAvailableForSpace file="cp_ service.go:921" time="2024-05-22 04:59:53.022" level=info msg="checkResourceAvailableForSpace: remainingCpu: 57, remainingMemory: 132.00, remainingStorage: 176.00" func=checkResourceAvailab leForSpace file="cp_service.go:922" time="2024-05-22 04:59:53.023" level=info msg="submitting job..." func=submitJob file="cp_service.go:157" time="2024-05-22 04:59:53.023" level=info msg="uploading file to bucket, objectName: jobs/fde55a1c-41b1-4efe-89c1-9a6aa1ed14bb.json, filePath: /tmp/jobs/fde55a1c-41b1-4efe-8 9c1-9a6aa1ed14bb.json" func=UploadFileToBucket file="storage_service.go:52" time="2024-05-22 04:59:53.129" level=info msg="uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4, spaceName: Hello-Task, hardwareName: CPU only · 4 vCPU · 4 GiB" func=DeploySpaceTa sk file="cp_service.go:738" 2024/05/22 04:59:53 Image path: build/0xaA5812Fb31fAA6C073285acD4cB185dDbeBDC224/spaces/Hello-Task {"stream":"Step 1/7 : FROM python:3.9"} {"stream":"\n"} {"status":"Pulling from library/python","id":"3.9"} {"status":"Pulling fs layer","progressDetail":{},"id":"c6cf28de8a06"} ...
{"stream":" ---\u003e f2540758e105\n"} {"aux":{"ID":"sha256:f2540758e10579627a789725a2b4fc36c916c26783373f055173e5cf7aa1fe9d"}} {"stream":"Successfully built f2540758e105\n"} {"stream":"Successfully tagged lagrange/hello-task-5377de0020c4:1716338729\n"} time="2024-05-22 02:46:00.098" level=info msg="Start deleting space service, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:782" time="2024-05-22 02:46:01.256" level=error msg="http status: 400 Bad Request, code:400, url:https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json" func=HttpRequest file="restful.go:127" time="2024-05-22 02:46:01.256" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json failed, status:error, message:invalid param value:record not found" func=HttpRequest file="restful.go:154" time="2024-05-22 02:46:01.256" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json failed, status:error, message:invalid param value:record not found" func=HttpGet file="restful.go:64" time="2024-05-22 02:46:09.114" level=info msg="Deleted space service finished, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:839" time="2024-05-22 02:46:09.133" level=info msg="Created deployment: deploy-868ab7df-b03f-4c9f-9360-5377de0020c4" func=DockerfileToK8s file="deploy.go:176" time="2024-05-22 02:47:33.213" level=info msg="file name:1_137b654d-8a0d-4d93-a655-12947e41baf6.json, chunk size:967" func=func1 file="file.go:248" time="2024-05-22 02:48:42.318" level=info msg="jobuuid: aab61ae7-6cb0-4113-b970-088f1f9809a7 successfully submitted to IPFS" func=submitJob file="cp_service.go:185" time="2024-05-22 02:48:42.588" level=info msg="submit job detail: {UUID:aab61ae7-6cb0-4113-b970-088f1f9809a7 Name:Job-25931c1a-221d-4f50-bc5b-42c0c821bd42 Status:submitted Duration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI:https://2d2faccf2937.acl.swanipfs.com/ipfs/QmZSEKgJh7PR9fairFWesBMWWKjafRqx367wUtF9pGpn6h StorageSource:lagrange TaskUUID:1e69456a-06b4-4743-b760-66cf739adb99 CreatedAt: UpdatedAt:1716338728 BuildLog:wss://log.computeprovider.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=build ContainerLog:wss://log.computeprovider.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=container NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:https://k4ct7uu41b.computeprovider.com}" func=ReceiveJob file="cp_service.go:152" [GIN] 2024/05/22 - 02:48:42 | 200 | 3m14s | 184.147.89.2 | POST "/api/v1/computing/lagrange/jobs"
The problem you encounter is probably an error in your nginx,I've encountered it before check in https://docs.swanchain.io/orchestrator/as-a-computing-provider/fcp-fog-computing-provider/faq
The problem you encounter is probably an error in your nginx,I've encountered it before check in https://docs.swanchain.io/orchestrator/as-a-computing-provider/fcp-fog-computing-provider/faq
@harleyLuke Thank you for the feedback. I have read the FAQ multiple times. nginx is not mentioned there. what steps did you take to fix your error?
The problem you encounter is probably an error in your nginx,I've encountered it before check in https://docs.swanchain.io/orchestrator/as-a-computing-provider/fcp-fog-computing-provider/faq
@harleyLuke Thank you for the feedback. I have read the FAQ multiple times. nginx is not mentioned there. what steps did you take to fix your error?
When running the test task, whether the exposed JobRealURL can be accessed. example : https://xxxxxxxxxx.computeprovider.com
Yeah.. the good new is that i have deployments since yesterday the bad news is that that ingress url https://xxxxxxxxxx.computeprovider.com/ only delivers
502 Bad Gateway
nginx/1.18.0 (Ubuntu)
kubectl logs -n ingress-nginx ingress-nginx-controller-7fb8b84675-qmtwq
I0525 12:44:49.667638 7 store.go:433] "Found valid IngressClass" ingress="ns-0x098b7ae10c02038079a741d2be7df599d38aa7d5/ing-d06b3558-3594-4948-8bb2-95a860646438" ingressclass="nginx"
I0525 12:44:49.667805 7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"ns-0x098b7ae10c02038079a741d2be7df599d38aa7d5", Name:"ing-d06b3558-3594-4948-8bb2-95a860646438", UID:"dabebfcd-a5d6-47ce-af0e-1940f8b7c8c6", APIVersion:"networking.k8s.io/v1", ResourceVersion:"5194551", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
Now i am one step further and could fix the 502 to get the ingress running. problem was that nginx on node1 could not talk with ingress-nginx-controller on node3. how to fix:
kubectl delete svc ingress-nginx-controller -n ingress-nginx
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.7.1/deploy/static/provider/cloud/deploy.yaml
kubectl get svc -n ingress-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller LoadBalancer 10.233.27.66 <pending> 80:32382/TCP,443:31573/TCP 19s
get pods -n ingress-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ingress-nginx-controller-7fb8b84675-qmtwq 1/1 Running 0 2d22h 10.233.71.45 node3 <none> <none>
nano /etc/nginx/conf.d/computeprovider.conf
proxy_pass http://node3:32382;
nginx -s reload
Success story: after 4 weeks of debugging, i got my compute-provider running. I can deploy AI tasks on my 4090 GPU.
I invested around 100 hours of my time into the saturn testnet, and now 50 hours into proxima. I am very grateful that this effort is reflected in the saturn rewards.
Going Mainnet i want to emphasize that support and documentation is crucial for such a project. I see a lot of room for improvement here. My impression was that in the last four weeks nobody from the team had the time to look into my problems for longer than three minutes. I see Leoj is doing a lot of coding and "quick support hints" - thank you for that! But it would be really nice if you had more compute provider support staff.
If any related issues in the community, please help them, We will invest more time in community problem solving
i made a fresh provider install with ansible.
cpu is not detected and also not gpu. what can i do?
also strange: i have no taks, but 3-7 cores are already allocated