swanchain / go-computing-provider

A golang implementation of computing provider
MIT License
22 stars 25 forks source link

FCP no jobs: "cpu_name" / "gpu" empty #58

Closed ThomasBlock closed 6 months ago

ThomasBlock commented 7 months ago

i made a fresh provider install with ansible.

cpu is not detected and also not gpu. what can i do?

also strange: i have no taks, but 3-7 cores are already allocated

kubectl get po -A -o wide
NAMESPACE         NAME                                        READY   STATUS      RESTARTS       AGE    IP               NODE     NOMINATED NODE   READINESS GATES
default           gpu-pod                                     0/1     Pending     0              118m   <none>           <none>   <none>           <none>
ingress-nginx     ingress-nginx-admission-create-rvbrw        0/1     Completed   0              52m    10.233.75.4      node2    <none>           <none>
ingress-nginx     ingress-nginx-admission-patch-z2psj         0/1     Completed   0              52m    10.233.75.5      node2    <none>           <none>
ingress-nginx     ingress-nginx-controller-7fb8b84675-nq7hr   1/1     Running     0              52m    10.233.75.6      node2    <none>           <none>
kube-system       calico-kube-controllers-794577df96-88pzr    1/1     Running     0              146m   10.233.75.1      node2    <none>           <none>
kube-system       calico-node-4xf8v                           1/1     Running     0              146m   192.168.128.62   node2    <none>           <none>
kube-system       calico-node-mqlpr                           1/1     Running     0              146m   192.168.128.61   node1    <none>           <none>
kube-system       calico-node-qj8kr                           1/1     Running     1 (114m ago)   146m   192.168.128.63   node3    <none>           <none>
kube-system       coredns-5c469774b8-j7qp9                    1/1     Running     0              145m   10.233.102.129   node1    <none>           <none>
kube-system       coredns-5c469774b8-mpwkw                    1/1     Running     0              145m   10.233.75.2      node2    <none>           <none>
kube-system       dns-autoscaler-f455cf558-lkff6              1/1     Running     0              145m   10.233.102.130   node1    <none>           <none>
kube-system       kube-apiserver-node1                        1/1     Running     1              147m   192.168.128.61   node1    <none>           <none>
kube-system       kube-controller-manager-node1               1/1     Running     2              147m   192.168.128.61   node1    <none>           <none>
kube-system       kube-proxy-gbft5                            1/1     Running     0              146m   192.168.128.62   node2    <none>           <none>
kube-system       kube-proxy-jb6p8                            1/1     Running     0              146m   192.168.128.61   node1    <none>           <none>
kube-system       kube-proxy-tvr6b                            1/1     Running     1 (114m ago)   146m   192.168.128.63   node3    <none>           <none>
kube-system       kube-scheduler-node1                        1/1     Running     1              147m   192.168.128.61   node1    <none>           <none>
kube-system       nginx-proxy-node2                           1/1     Running     0              146m   192.168.128.62   node2    <none>           <none>
kube-system       nginx-proxy-node3                           1/1     Running     1 (114m ago)   146m   192.168.128.63   node3    <none>           <none>
kube-system       nodelocaldns-dm8l2                          1/1     Running     0              145m   192.168.128.62   node2    <none>           <none>
kube-system       nodelocaldns-l87xz                          1/1     Running     1 (114m ago)   145m   192.168.128.63   node3    <none>           <none>
kube-system       nodelocaldns-nh595                          1/1     Running     0              145m   192.168.128.61   node1    <none>           <none>
kube-system       nvidia-device-plugin-daemonset-4vv2g        1/1     Running     0              117m   10.233.75.3      node2    <none>           <none>
kube-system       nvidia-device-plugin-daemonset-s852c        1/1     Running     1 (114m ago)   117m   10.233.71.2      node3    <none>           <none>
kube-system       resource-exporter-ds-4kxrk                  1/1     Running     0              43m    10.233.75.7      node2    <none>           <none>
kube-system       resource-exporter-ds-rjbhq                  1/1     Running     0              43m    10.233.71.3      node3    <none>           <none>
tigera-operator   tigera-operator-549d4f9bdb-tqkbm            1/1     Running     0              137m   192.168.128.62   node2    <none>           <none>
{
  "node_id": "040b84e0b707b7860f688837933fc79c21b403e8b4a99981daff08a86a52dd14e23a7c37befab3858633678a325b10a90467fb83d73424cbd022bff17ddc5863cd",
  "region": "North Rhine-Westphalia-DE",
  "cluster_info": [
    {
      "machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
      "cpu_name": "",
      "cpu": {
        "total": "10",
        "used": "7",
        "free": "3"
      },
      "vcpu": {
        "total": "10",
        "used": "7",
        "free": "3"
      },
      "memory": {
        "total": "11.00 GiB",
        "used": "0.00 GiB",
        "free": "11.00 GiB"
      },
      "gpu": {
        "driver_version": "",
        "cuda_version": "",
        "attached_gpus": 0,
        "details": null
      },
      "storage": {
        "total": "176.00 GiB",
        "used": "0.00 GiB",
        "free": "176.00 GiB"
      }
    },
    {
      "machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
      "cpu_name": "",
      "cpu": {
        "total": "64",
        "used": "6",
        "free": "58"
      },
      "vcpu": {
        "total": "64",
        "used": "6",
        "free": "58"
      },
      "memory": {
        "total": "126.00 GiB",
        "used": "0.00 GiB",
        "free": "126.00 GiB"
      },
      "gpu": {
        "driver_version": "",
        "cuda_version": "",
        "attached_gpus": 0,
        "details": []
      },
      "storage": {
        "total": "437.00 GiB",
        "used": "0.00 GiB",
        "free": "437.00 GiB"
      }
    },
    {
      "machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
      "cpu_name": "",
      "cpu": {
        "total": "24",
        "used": "3",
        "free": "21"
      },
      "vcpu": {
        "total": "24",
        "used": "3",
        "free": "21"
      },
      "memory": {
        "total": "31.00 GiB",
        "used": "0.00 GiB",
        "free": "31.00 GiB"
      },
      "gpu": {
        "driver_version": "",
        "cuda_version": "",
        "attached_gpus": 0,
        "details": []
      },
      "storage": {
        "total": "437.00 GiB",
        "used": "0.00 GiB",
        "free": "437.00 GiB"
      }
    }
  ],
sonic-chain commented 6 months ago

Is the image version of resource-exporter filswan/resource-exporter:v11.2.6? Can you post the pod log corresponding to the machine resource-exporter?

ThomasBlock commented 6 months ago

Is the image version of resource-exporter filswan/resource-exporter:v11.2.6? Can you post the pod log corresponding to the machine resource-exporter?

yes i have v11.2.6

NAME    STATUS   ROLES           AGE    VERSION
node1   Ready    control-plane   3d1h   v1.27.7
node2   Ready    <none>          3d1h   v1.27.7
node3   Ready    <none>          3d1h   v1.27.7

kubectl describe nodes | grep Taints
Taints:             role=blocked:NoSchedule
Taints:             <none>
Taints:             <none>

i see this log in cp, altough i am root and we are using kubernetes.. what could be missing? time="2024-05-02 20:46:29.818" level=error msg="Failed get image list, error: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" func=CleanResource file="docker_service.go:319"

node1: only ingress, tainted

node2: Intel, no GPU

kubectl logs resource-exporter-ds-4kxrk -n kube-system
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"}

node 3: AMD + GPU

nvidia-smi -L
GPU 0: NVIDIA RTX A4000 (UUID: GPU-976b5f8c-fbec-09a2-5aae-1c0b6dffe3ce)
kubectl logs -n kube-system resource-exporter-ds-rjbhq
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"AMD"}
kubectl logs -n kube-system nvidia-device-plugin-daemonset-s852c
I0429 18:04:40.162448       1 main.go:178] Starting FS watcher.
I0429 18:04:40.162508       1 main.go:185] Starting OS watcher.
I0429 18:04:40.162742       1 main.go:200] Starting Plugins.
I0429 18:04:40.162749       1 main.go:257] Loading configuration.
I0429 18:04:40.162962       1 main.go:265] Updating config with default resource matching patterns.
I0429 18:04:40.163438       1 main.go:276] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0429 18:04:40.163450       1 main.go:279] Retrieving plugins.
W0429 18:04:40.163840       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0429 18:04:40.163876       1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0429 18:04:40.163905       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0429 18:04:40.163913       1 factory.go:112] Incompatible platform detected
E0429 18:04:40.163916       1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0429 18:04:40.163920       1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0429 18:04:40.163923       1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0429 18:04:40.163930       1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0429 18:04:40.163935       1 main.go:308] No devices found. Waiting indefinitely.
sonic-chain commented 6 months ago

The nvidia-device-plugin above has given a prompt. On some machines, the nvidia environment has not been installed properly. You can install the prompt to reinstall and use the official test method to verify. Refer to https://github.com/NVIDIA/k8s-device-plugin#quick-start

ThomasBlock commented 6 months ago

The nvidia-device-plugin above has given a prompt. On some machines, the nvidia environment has not been installed properly. You can install the prompt to reinstall and use the official test method to verify. Refer to https://github.com/NVIDIA/k8s-device-plugin#quick-start

Yes thank you i could fix GPU with

nano /etc/containerd/config.toml
    [plugins."io.containerd.grpc.v1.cri".containerd]
      #default_runtime_name = "runc"
      default_runtime_name = "nvidia"

{"gpu":{"driver_version":"535.171.04","cuda_version":"12020","attached_gpus":1,"details":[{"product_name":"NVIDIA A4000","fb_memory_usage":{"total":"16376 MiB","used":"268 MiB","free":"16107 MiB"},"bar1_memory_usage":{"total":"256 MiB","used":"2 MiB","free":"253 MiB"}}]},"cpu_name":"AMD"}

but something is still odd.. CPU still not recognized.. still no jobs.. only one node showed on orchestrator

image

{
  "node_id": "040b84e0b707b7860f688837933fc79c21b403e8b4a99981daff08a86a52dd14e23a7c37befab3858633678a325b10a90467fb83d73424cbd022bff17ddc5863cd",
  "region": "North Rhine-Westphalia-DE",
  "cluster_info": [
    {
      "machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
      "cpu_name": "",
      "cpu": {
        "total": "10",
        "used": "7",
        "free": "3"
      },
      "vcpu": {
        "total": "10",
        "used": "7",
        "free": "3"
      },
      "memory": {
        "total": "11.00 GiB",
        "used": "0.00 GiB",
        "free": "11.00 GiB"
      },
      "gpu": {
        "driver_version": "",
        "cuda_version": "",
        "attached_gpus": 0,
        "details": null
      },
      "storage": {
        "total": "176.00 GiB",
        "used": "0.00 GiB",
        "free": "176.00 GiB"
      }
    },
    {
      "machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
      "cpu_name": "",
      "cpu": {
        "total": "64",
        "used": "6",
        "free": "58"
      },
      "vcpu": {
        "total": "64",
        "used": "6",
        "free": "58"
      },
      "memory": {
        "total": "126.00 GiB",
        "used": "0.00 GiB",
        "free": "126.00 GiB"
      },
      "gpu": {
        "driver_version": "",
        "cuda_version": "",
        "attached_gpus": 0,
        "details": []
      },
      "storage": {
        "total": "437.00 GiB",
        "used": "0.00 GiB",
        "free": "437.00 GiB"
      }
    },
    {
      "machine_id": "6bf3e53ddf306fdb450cd2336d41e844",
      "cpu_name": "",
      "cpu": {
        "total": "24",
        "used": "3",
        "free": "21"
      },
      "vcpu": {
        "total": "24",
        "used": "3",
        "free": "21"
      },
      "memory": {
        "total": "31.00 GiB",
        "used": "0.00 GiB",
        "free": "31.00 GiB"
      },
      "gpu": {
        "driver_version": "535.171.04",
        "cuda_version": "12020",
        "attached_gpus": 1,
        "details": [
          {
            "product_name": "NVIDIA A4000",
            "status": "available",
            "fb_memory_usage": {
              "total": "16376 MiB",
              "used": "268 MiB",
              "free": "16107 MiB"
            },
            "bar1_memory_usage": {
              "total": "256 MiB",
              "used": "2 MiB",
              "free": "253 MiB"
            }
          }
        ]
      },
      "storage": {
        "total": "437.00 GiB",
        "used": "0.00 GiB",
        "free": "437.00 GiB"
      }
    }
  ],
  "multi_address": "/ip4/XXX/tcp/8085",
  "node_name": "ThomasBlock.io"
}
ThomasBlock commented 6 months ago

so what about this error? i dont have docker istalled, only kubernetes and containerd. do i need docker? time="2024-05-03 13:58:39.609" level=error msg="Failed get image list, error: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" func=CleanResource file="docker_service.go:319"

sonic-chain commented 6 months ago

yes, Docker is required, use it to build the image, only containerd or docker as the runtime is allowed when running.

ThomasBlock commented 6 months ago

yes, Docker is required, use it to build the image, only containerd or docker as the runtime is allowed when running.

okay docker is working now.. no erros.. but also no jobs are accepted.. so what can i do?

i see tls hanshake errors ( everytime 38.104.153.43 connetcs, there are two events - one tls error and one 200 access..) - ssl is woring for me, what could be the problem?

[GIN] 2024/05/04 - 20:11:40 | 200 |   29.702381ms |   38.104.153.43 | GET      "/api/v1/computing/cp"
[GIN] 2024/05/04 - 20:14:14 | 200 |      18.121µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
[GIN] 2024/05/04 - 20:14:18 | 200 |      15.251µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
2024/05/04 20:18:40 http: TLS handshake error from 38.104.153.43:46942: remote error: tls: bad certificate
[GIN] 2024/05/04 - 20:18:40 | 200 |   21.623729ms |   38.104.153.43 | GET      "/api/v1/computing/cp"
2024/05/04 20:21:31 http: TLS handshake error from 38.104.153.43:54128: remote error: tls: bad certificate
[GIN] 2024/05/04 - 20:21:31 | 200 |   19.620076ms |   38.104.153.43 | GET      "/api/v1/computing/cp"

i found this in your docs.. but it is relevant for getting jobs.. or only for printinh job output in browser? i use certbot.,.

Q: What are the requirements for SSL certificates needed in CP? 
A: Please use certificates issued by trusted Certificate Authorities (CA). Currently, certificates generated by Certbot are not functioning properly. 
Otherwise, the application won't be displayed correctly on the Space App page.
ThomasBlock commented 6 months ago

i just noticed that for every provider, the presentation is different on the orchestrator. we only have one machine node on proxmima, whereas we had several on saturn...

image

i also recognized that users with less than 64 GB get zero jobs..is that just the secret? ram is now enforced?

image

image

Normalnoise commented 6 months ago

yes, Docker is required, use it to build the image, only containerd or docker as the runtime is allowed when running.

okay docker is working now.. no erros.. but also no jobs are accepted.. so what can i do?

i see tls hanshake errors ( everytime 38.104.153.43 connetcs, there are two events - one tls error and one 200 access..) - ssl is woring for me, what could be the problem?

[GIN] 2024/05/04 - 20:11:40 | 200 |   29.702381ms |   38.104.153.43 | GET      "/api/v1/computing/cp"
[GIN] 2024/05/04 - 20:14:14 | 200 |      18.121µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
[GIN] 2024/05/04 - 20:14:18 | 200 |      15.251µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
2024/05/04 20:18:40 http: TLS handshake error from 38.104.153.43:46942: remote error: tls: bad certificate
[GIN] 2024/05/04 - 20:18:40 | 200 |   21.623729ms |   38.104.153.43 | GET      "/api/v1/computing/cp"
2024/05/04 20:21:31 http: TLS handshake error from 38.104.153.43:54128: remote error: tls: bad certificate
[GIN] 2024/05/04 - 20:21:31 | 200 |   19.620076ms |   38.104.153.43 | GET      "/api/v1/computing/cp"

i found this in your docs.. but it is relevant for getting jobs.. or only for printinh job output in browser? i use certbot.,.

Q: What are the requirements for SSL certificates needed in CP? 
A: Please use certificates issued by trusted Certificate Authorities (CA). Currently, certificates generated by Certbot are not functioning properly. 
Otherwise, the application won't be displayed correctly on the Space App page.

there are some issues in the Orchestrator server, we have contact with them to solve it. It is not CP's issue

Normalnoise commented 6 months ago

i just noticed that for every provider, the presentation is different on the orchestrator. we only have one machine node on proxmima, whereas we had several on saturn...

image

i also recognized that users with less than 64 GB get zero jobs..is that just the secret? ram is now enforced?

image

image

where can you find the node on the Saturn?

Normalnoise commented 6 months ago

The orchestrator has some issues, so FCP can not get any new jobs. the orchestrator team will solve it asap

ThomasBlock commented 6 months ago

i just noticed that for every provider, the presentation is different on the orchestrator. we only have one machine node on proxmima, whereas we had several on saturn... image i also recognized that users with less than 64 GB get zero jobs..is that just the secret? ram is now enforced? image image

where can you find the node on the Saturn?

i was speaking about the page in the past. here an old screenshot. every kubernetes node is listed with separate CPU, RAM and GPU

Bildschirmfoto vom 2024-04-28 09-42-31

ThomasBlock commented 6 months ago

The orchestrator has some issues, so FCP can not get any new jobs. the orchestrator team will solve it asap

So what is the timeline for this? @sonic-chain @Normalnoise Atom Accelerator started two weeks ago, and i still could not accept a single FCP job

Normalnoise commented 6 months ago

this issue has been solved for a long time, can't you receive any task?

Normalnoise commented 6 months ago

I think you can follow the steps to check your FCP:

[CONTRACT] SWAN_COLLATERAL_CONTRACT="0xfD9190027cd42Fc4f653Dfd9c4c45aeBAf0ae063"


 - secondly, please ensure your owner has enough collateral:

computing-provider collateral info



 - thirdly, please follow the steps to test your FCP:
  https://docs.swanchain.io/orchestrator/as-a-computing-provider/fcp-fog-computing-provider/faq#q-how-can-i-know-if-the-status-of-the-computing-provider-is-normal

if you still can not get any task, please provide the screenshot and CP's log
ThomasBlock commented 6 months ago

@Normalnoise @sonic-chain Thank you for the reply. i checked again every part of my setup.

it really looks like an orchestrator problem, can you check it on that side? node 04e09e3106afc4c878e211c222ae8b0c9640ea47d38d16feddf0630d267eb2c217efe74eda8a71d44b0cb4fd3eec23cc9e6cc2a43bf54f89c08e87150937207da2

image

what i see as a problem:

image

kubectl logs -n kube-system resource-exporter-ds-kxv96
The node not found nvm libnvidia, if the node does not have a GPU, this error can be ignored.
{"gpu":{"driver_version":"","cuda_version":"","attached_gpus":0,"details":null},"cpu_name":"INTEL"}
ThomasBlock commented 6 months ago

i also switched to a bigger setup to make sure i am not limited by the requirement "at least 64 GB RAM"

image

i can pick my server on lagrange, but still nothing on my compute-provider..

image

image

so maybe related to this tls handshake error? why cant you support certbot? it's a widely used system. i have no other way for certificates..

Q: What are the requirements for SSL certificates needed in CP? 
A: Please use certificates issued by trusted Certificate Authorities (CA). Currently, certificates generated by Certbot are not functioning properly. 
ThomasBlock commented 6 months ago

Here some new logs - things are happening, but still no active deploymens..

time="2024-05-22 06:02:41.541" level=info msg="file name:1_6b658295-5949-4263-bd3e-f1dbb145b4a9.json, chunk size:967" func=func1 file="file.go:248"
time="2024-05-22 06:04:47.712" level=info msg="jobuuid: 5951a3a7-7898-47b0-9235-1e12649af534 successfully submitted to IPFS" func=submitJob file="cp_service.go:185"
time="2024-05-22 06:04:47.986" level=info msg="submit job detail: {UUID:5951a3a7-7898-47b0-9235-1e12649af534 Name:Job-6d444ddf-fb40-4eee-a08b-51ac3614f3f2 Status:submitted D
uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI:https://2d2faccf2937.acl.swanipfs.com/ipfs/QmfW9Q
z9SN2aHUzMPE9cn9mdpVPsFRBTYvVKYmTg4u9Cgp StorageSource:lagrange TaskUUID:b5081241-fb91-4ae2-a63c-160a94a672d7 CreatedAt: UpdatedAt:1716350407 BuildLog:wss://log.computeprovi
der.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=build ContainerLog:wss://log.computeprovider.com:8086/api/v1/c
omputing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=container NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc848141
80f8f5e7c9c8069c3a77f6348a82b2e0a9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:https://jkqvam1yyb.computeprovider.com}" func=ReceiveJob file="cp_service.
go:152"
[GIN] 2024/05/22 - 06:04:47 | 200 |         4m40s |    184.147.89.2 | POST     "/api/v1/computing/lagrange/jobs"
{"stream":" ---\u003e Running in 9161989555b9\n"}
{"stream":" ---\u003e dc744a185552\n"}
{"aux":{"ID":"sha256:dc744a1855526c8fdcef436646ee92033b45e29b5150e1491224f4c8d887f273"}}
{"stream":"Successfully built dc744a185552\n"}
{"stream":"Successfully tagged lagrange/hello-task-5377de0020c4:1716346793\n"}
time="2024-05-22 05:00:26.367" level=info msg="Start deleting space service, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:782"
time="2024-05-22 05:00:35.383" level=info msg="Deleted space service finished, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:839"
time="2024-05-22 05:00:35.397" level=info msg="Created deployment: deploy-868ab7df-b03f-4c9f-9360-5377de0020c4" func=DockerfileToK8s file="deploy.go:176"
time="2024-05-22 05:01:18.708" level=info msg="file name:1_fde55a1c-41b1-4efe-89c1-9a6aa1ed14bb.json, chunk size:967" func=func1 file="file.go:248"
time="2024-05-22 05:01:34.160" level=info msg="Job received Data: {UUID:88675737-9917-4238-9f25-23372e9a06a4 Name:Job-0ba901b0-3957-4fd1-b9d3-36658bb5988c Status:Submitted D
uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI: StorageSource:lagrange TaskUUID:27887610-1ab1-45
f8-a448-942f9d98a360 CreatedAt: UpdatedAt: BuildLog: ContainerLog: NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a
9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:}" func=ReceiveJob file="cp_service.go:74"
time="2024-05-22 05:01:34.479" level=error msg="space API response not OK. Status Code: 404" func=ReceiveJob file="cp_service.go:106"
[GIN] 2024/05/22 - 05:01:34 | 500 |   319.10993ms |    184.147.89.2 | POST     "/api/v1/computing/lagrange/jobs"
time="2024-05-22 04:59:52.670" level=info msg="Job received Data: {UUID:e5608334-2ec5-42c8-b05c-96712826df7c Name:Job-21b372d6-0d04-4798-bfbf-1ce020bd8709 Status:Submitted D
uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI: StorageSource:lagrange TaskUUID:5c9d5bb8-4725-4b
e7-8f56-f0ef114e8118 CreatedAt: UpdatedAt: BuildLog: ContainerLog: NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a
9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:}" func=ReceiveJob file="cp_service.go:74"
[GIN] 2024/05/22 - 04:59:53 | 200 |      21.741µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
time="2024-05-22 04:59:53.022" level=info msg="checkResourceAvailableForSpace: needCpu: 4, needMemory: 4.00, needStorage: 5.00" func=checkResourceAvailableForSpace file="cp_
service.go:921"
time="2024-05-22 04:59:53.022" level=info msg="checkResourceAvailableForSpace: remainingCpu: 57, remainingMemory: 132.00, remainingStorage: 176.00" func=checkResourceAvailab
leForSpace file="cp_service.go:922"
time="2024-05-22 04:59:53.023" level=info msg="submitting job..." func=submitJob file="cp_service.go:157"
time="2024-05-22 04:59:53.023" level=info msg="uploading file to bucket, objectName: jobs/fde55a1c-41b1-4efe-89c1-9a6aa1ed14bb.json, filePath: /tmp/jobs/fde55a1c-41b1-4efe-8
9c1-9a6aa1ed14bb.json" func=UploadFileToBucket file="storage_service.go:52"
time="2024-05-22 04:59:53.129" level=info msg="uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4, spaceName: Hello-Task, hardwareName: CPU only · 4 vCPU · 4 GiB" func=DeploySpaceTa
sk file="cp_service.go:738"
2024/05/22 04:59:53 Image path: build/0xaA5812Fb31fAA6C073285acD4cB185dDbeBDC224/spaces/Hello-Task
{"stream":"Step 1/7 : FROM python:3.9"}
{"stream":"\n"}
{"status":"Pulling from library/python","id":"3.9"}
{"status":"Pulling fs layer","progressDetail":{},"id":"c6cf28de8a06"}
...
{"stream":" ---\u003e f2540758e105\n"}
{"aux":{"ID":"sha256:f2540758e10579627a789725a2b4fc36c916c26783373f055173e5cf7aa1fe9d"}}
{"stream":"Successfully built f2540758e105\n"}
{"stream":"Successfully tagged lagrange/hello-task-5377de0020c4:1716338729\n"}
time="2024-05-22 02:46:00.098" level=info msg="Start deleting space service, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:782"
time="2024-05-22 02:46:01.256" level=error msg="http status: 400 Bad Request, code:400, url:https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json" func=HttpRequest file="restful.go:127"
time="2024-05-22 02:46:01.256" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json failed, status:error, message:invalid param value:record not found" func=HttpRequest file="restful.go:154"
time="2024-05-22 02:46:01.256" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json failed, status:error, message:invalid param value:record not found" func=HttpGet file="restful.go:64"
time="2024-05-22 02:46:09.114" level=info msg="Deleted space service finished, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:839"
time="2024-05-22 02:46:09.133" level=info msg="Created deployment: deploy-868ab7df-b03f-4c9f-9360-5377de0020c4" func=DockerfileToK8s file="deploy.go:176"
time="2024-05-22 02:47:33.213" level=info msg="file name:1_137b654d-8a0d-4d93-a655-12947e41baf6.json, chunk size:967" func=func1 file="file.go:248"
time="2024-05-22 02:48:42.318" level=info msg="jobuuid: aab61ae7-6cb0-4113-b970-088f1f9809a7 successfully submitted to IPFS" func=submitJob file="cp_service.go:185"
time="2024-05-22 02:48:42.588" level=info msg="submit job detail: {UUID:aab61ae7-6cb0-4113-b970-088f1f9809a7 Name:Job-25931c1a-221d-4f50-bc5b-42c0c821bd42 Status:submitted Duration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI:https://2d2faccf2937.acl.swanipfs.com/ipfs/QmZSEKgJh7PR9fairFWesBMWWKjafRqx367wUtF9pGpn6h StorageSource:lagrange TaskUUID:1e69456a-06b4-4743-b760-66cf739adb99 CreatedAt: UpdatedAt:1716338728 BuildLog:wss://log.computeprovider.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=build ContainerLog:wss://log.computeprovider.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=container NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:https://k4ct7uu41b.computeprovider.com}" func=ReceiveJob file="cp_service.go:152"
[GIN] 2024/05/22 - 02:48:42 | 200 |         3m14s |    184.147.89.2 | POST     "/api/v1/computing/lagrange/jobs"
harleyLuke commented 6 months ago

Here some new logs - things are happening, but still no active deploymens..

time="2024-05-22 06:02:41.541" level=info msg="file name:1_6b658295-5949-4263-bd3e-f1dbb145b4a9.json, chunk size:967" func=func1 file="file.go:248"
time="2024-05-22 06:04:47.712" level=info msg="jobuuid: 5951a3a7-7898-47b0-9235-1e12649af534 successfully submitted to IPFS" func=submitJob file="cp_service.go:185"
time="2024-05-22 06:04:47.986" level=info msg="submit job detail: {UUID:5951a3a7-7898-47b0-9235-1e12649af534 Name:Job-6d444ddf-fb40-4eee-a08b-51ac3614f3f2 Status:submitted D
uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI:https://2d2faccf2937.acl.swanipfs.com/ipfs/QmfW9Q
z9SN2aHUzMPE9cn9mdpVPsFRBTYvVKYmTg4u9Cgp StorageSource:lagrange TaskUUID:b5081241-fb91-4ae2-a63c-160a94a672d7 CreatedAt: UpdatedAt:1716350407 BuildLog:wss://log.computeprovi
der.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=build ContainerLog:wss://log.computeprovider.com:8086/api/v1/c
omputing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=container NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc848141
80f8f5e7c9c8069c3a77f6348a82b2e0a9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:https://jkqvam1yyb.computeprovider.com}" func=ReceiveJob file="cp_service.
go:152"
[GIN] 2024/05/22 - 06:04:47 | 200 |         4m40s |    184.147.89.2 | POST     "/api/v1/computing/lagrange/jobs"
{"stream":" ---\u003e Running in 9161989555b9\n"}
{"stream":" ---\u003e dc744a185552\n"}
{"aux":{"ID":"sha256:dc744a1855526c8fdcef436646ee92033b45e29b5150e1491224f4c8d887f273"}}
{"stream":"Successfully built dc744a185552\n"}
{"stream":"Successfully tagged lagrange/hello-task-5377de0020c4:1716346793\n"}
time="2024-05-22 05:00:26.367" level=info msg="Start deleting space service, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:782"
time="2024-05-22 05:00:35.383" level=info msg="Deleted space service finished, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:839"
time="2024-05-22 05:00:35.397" level=info msg="Created deployment: deploy-868ab7df-b03f-4c9f-9360-5377de0020c4" func=DockerfileToK8s file="deploy.go:176"
time="2024-05-22 05:01:18.708" level=info msg="file name:1_fde55a1c-41b1-4efe-89c1-9a6aa1ed14bb.json, chunk size:967" func=func1 file="file.go:248"
time="2024-05-22 05:01:34.160" level=info msg="Job received Data: {UUID:88675737-9917-4238-9f25-23372e9a06a4 Name:Job-0ba901b0-3957-4fd1-b9d3-36658bb5988c Status:Submitted D
uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI: StorageSource:lagrange TaskUUID:27887610-1ab1-45
f8-a448-942f9d98a360 CreatedAt: UpdatedAt: BuildLog: ContainerLog: NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a
9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:}" func=ReceiveJob file="cp_service.go:74"
time="2024-05-22 05:01:34.479" level=error msg="space API response not OK. Status Code: 404" func=ReceiveJob file="cp_service.go:106"
[GIN] 2024/05/22 - 05:01:34 | 500 |   319.10993ms |    184.147.89.2 | POST     "/api/v1/computing/lagrange/jobs"
time="2024-05-22 04:59:52.670" level=info msg="Job received Data: {UUID:e5608334-2ec5-42c8-b05c-96712826df7c Name:Job-21b372d6-0d04-4798-bfbf-1ce020bd8709 Status:Submitted D
uration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI: StorageSource:lagrange TaskUUID:5c9d5bb8-4725-4b
e7-8f56-f0ef114e8118 CreatedAt: UpdatedAt: BuildLog: ContainerLog: NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a
9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:}" func=ReceiveJob file="cp_service.go:74"
[GIN] 2024/05/22 - 04:59:53 | 200 |      21.741µs |   38.104.153.43 | GET      "/api/v1/computing/host/info"
time="2024-05-22 04:59:53.022" level=info msg="checkResourceAvailableForSpace: needCpu: 4, needMemory: 4.00, needStorage: 5.00" func=checkResourceAvailableForSpace file="cp_
service.go:921"
time="2024-05-22 04:59:53.022" level=info msg="checkResourceAvailableForSpace: remainingCpu: 57, remainingMemory: 132.00, remainingStorage: 176.00" func=checkResourceAvailab
leForSpace file="cp_service.go:922"
time="2024-05-22 04:59:53.023" level=info msg="submitting job..." func=submitJob file="cp_service.go:157"
time="2024-05-22 04:59:53.023" level=info msg="uploading file to bucket, objectName: jobs/fde55a1c-41b1-4efe-89c1-9a6aa1ed14bb.json, filePath: /tmp/jobs/fde55a1c-41b1-4efe-8
9c1-9a6aa1ed14bb.json" func=UploadFileToBucket file="storage_service.go:52"
time="2024-05-22 04:59:53.129" level=info msg="uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4, spaceName: Hello-Task, hardwareName: CPU only · 4 vCPU · 4 GiB" func=DeploySpaceTa
sk file="cp_service.go:738"
2024/05/22 04:59:53 Image path: build/0xaA5812Fb31fAA6C073285acD4cB185dDbeBDC224/spaces/Hello-Task
{"stream":"Step 1/7 : FROM python:3.9"}
{"stream":"\n"}
{"status":"Pulling from library/python","id":"3.9"}
{"status":"Pulling fs layer","progressDetail":{},"id":"c6cf28de8a06"}
...
{"stream":" ---\u003e f2540758e105\n"}
{"aux":{"ID":"sha256:f2540758e10579627a789725a2b4fc36c916c26783373f055173e5cf7aa1fe9d"}}
{"stream":"Successfully built f2540758e105\n"}
{"stream":"Successfully tagged lagrange/hello-task-5377de0020c4:1716338729\n"}
time="2024-05-22 02:46:00.098" level=info msg="Start deleting space service, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:782"
time="2024-05-22 02:46:01.256" level=error msg="http status: 400 Bad Request, code:400, url:https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json" func=HttpRequest file="restful.go:127"
time="2024-05-22 02:46:01.256" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json failed, status:error, message:invalid param value:record not found" func=HttpRequest file="restful.go:154"
time="2024-05-22 02:46:01.256" level=error msg="https://api.swanipfs.com/api/v2/oss_file/get_file_by_object_name?bucket_uid=fde78405-f495-4cb2-88ca-eb9ec61afa98&object_name=jobs/137b654d-8a0d-4d93-a655-12947e41baf6.json failed, status:error, message:invalid param value:record not found" func=HttpGet file="restful.go:64"
time="2024-05-22 02:46:09.114" level=info msg="Deleted space service finished, space_uuid: 868ab7df-b03f-4c9f-9360-5377de0020c4" func=deleteJob file="cp_service.go:839"
time="2024-05-22 02:46:09.133" level=info msg="Created deployment: deploy-868ab7df-b03f-4c9f-9360-5377de0020c4" func=DockerfileToK8s file="deploy.go:176"
time="2024-05-22 02:47:33.213" level=info msg="file name:1_137b654d-8a0d-4d93-a655-12947e41baf6.json, chunk size:967" func=func1 file="file.go:248"
time="2024-05-22 02:48:42.318" level=info msg="jobuuid: aab61ae7-6cb0-4113-b970-088f1f9809a7 successfully submitted to IPFS" func=submitJob file="cp_service.go:185"
time="2024-05-22 02:48:42.588" level=info msg="submit job detail: {UUID:aab61ae7-6cb0-4113-b970-088f1f9809a7 Name:Job-25931c1a-221d-4f50-bc5b-42c0c821bd42 Status:submitted Duration:3600 JobSourceURI:https://data.mcs.lagrangedao.org/ipfs/QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq JobResultURI:https://2d2faccf2937.acl.swanipfs.com/ipfs/QmZSEKgJh7PR9fairFWesBMWWKjafRqx367wUtF9pGpn6h StorageSource:lagrange TaskUUID:1e69456a-06b4-4743-b760-66cf739adb99 CreatedAt: UpdatedAt:1716338728 BuildLog:wss://log.computeprovider.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=build ContainerLog:wss://log.computeprovider.com:8086/api/v1/computing/lagrange/spaces/log?space_id=QmWuAvr7cPEsiA1znXZtr8DxQvroggvkonCJ7TvA3yThuq&type=container NodeIdJobSourceUriSignature:0xe80e592207f9850c079383794926926de2cbc84814180f8f5e7c9c8069c3a77f6348a82b2e0a9e4a6237877728a6b8098f374a6cc52b3777f0271b80079952871c JobRealUri:https://k4ct7uu41b.computeprovider.com}" func=ReceiveJob file="cp_service.go:152"
[GIN] 2024/05/22 - 02:48:42 | 200 |         3m14s |    184.147.89.2 | POST     "/api/v1/computing/lagrange/jobs"

The problem you encounter is probably an error in your nginx,I've encountered it before check in https://docs.swanchain.io/orchestrator/as-a-computing-provider/fcp-fog-computing-provider/faq

ThomasBlock commented 6 months ago

The problem you encounter is probably an error in your nginx,I've encountered it before check in https://docs.swanchain.io/orchestrator/as-a-computing-provider/fcp-fog-computing-provider/faq

@harleyLuke Thank you for the feedback. I have read the FAQ multiple times. nginx is not mentioned there. what steps did you take to fix your error?

harleyLuke commented 6 months ago

The problem you encounter is probably an error in your nginx,I've encountered it before check in https://docs.swanchain.io/orchestrator/as-a-computing-provider/fcp-fog-computing-provider/faq

@harleyLuke Thank you for the feedback. I have read the FAQ multiple times. nginx is not mentioned there. what steps did you take to fix your error?

When running the test task, whether the exposed JobRealURL can be accessed. example : https://xxxxxxxxxx.computeprovider.com

ThomasBlock commented 6 months ago

Yeah.. the good new is that i have deployments since yesterday the bad news is that that ingress url https://xxxxxxxxxx.computeprovider.com/ only delivers

502 Bad Gateway
nginx/1.18.0 (Ubuntu)
kubectl logs -n ingress-nginx ingress-nginx-controller-7fb8b84675-qmtwq

I0525 12:44:49.667638       7 store.go:433] "Found valid IngressClass" ingress="ns-0x098b7ae10c02038079a741d2be7df599d38aa7d5/ing-d06b3558-3594-4948-8bb2-95a860646438" ingressclass="nginx"
I0525 12:44:49.667805       7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"ns-0x098b7ae10c02038079a741d2be7df599d38aa7d5", Name:"ing-d06b3558-3594-4948-8bb2-95a860646438", UID:"dabebfcd-a5d6-47ce-af0e-1940f8b7c8c6", APIVersion:"networking.k8s.io/v1", ResourceVersion:"5194551", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
ThomasBlock commented 6 months ago

Now i am one step further and could fix the 502 to get the ingress running. problem was that nginx on node1 could not talk with ingress-nginx-controller on node3. how to fix:

kubectl delete svc ingress-nginx-controller -n ingress-nginx
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.7.1/deploy/static/provider/cloud/deploy.yaml

kubectl get svc -n ingress-nginx
NAME                                 TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller             LoadBalancer   10.233.27.66   <pending>     80:32382/TCP,443:31573/TCP   19s

get pods -n ingress-nginx -o wide
NAME                                        READY   STATUS    RESTARTS   AGE     IP             NODE    NOMINATED NODE   READINESS GATES
ingress-nginx-controller-7fb8b84675-qmtwq   1/1     Running   0          2d22h   10.233.71.45   node3   <none>           <none>

nano /etc/nginx/conf.d/computeprovider.conf
proxy_pass http://node3:32382; 

nginx -s reload

image

ThomasBlock commented 6 months ago

Success story: after 4 weeks of debugging, i got my compute-provider running. I can deploy AI tasks on my 4090 GPU.

I invested around 100 hours of my time into the saturn testnet, and now 50 hours into proxima. I am very grateful that this effort is reflected in the saturn rewards.

Going Mainnet i want to emphasize that support and documentation is crucial for such a project. I see a lot of room for improvement here. My impression was that in the last four weeks nobody from the team had the time to look into my problems for longer than three minutes. I see Leoj is doing a lot of coding and "quick support hints" - thank you for that! But it would be really nice if you had more compute provider support staff.

Normalnoise commented 6 months ago

If any related issues in the community, please help them, We will invest more time in community problem solving