feat: report CP's resource to Lagrange server according to the heartbeat

sonic-chain commented 1 year ago

The cp heartbeat reporting node's resource information includes information about its CPU (number of cores, model), memory size, disk size, and Gpu (model, quantity, memory size). Since a k8s cluster has one or more nodes, so it's reported a list: each cluster node's resource information.

Normalnoise commented 1 year ago

The CP plans to report the resources to the server by heartbeat. Because one CP maybe have many nodes(physical machines), the resources will be a resource list, every resource will include:

CPU (cores, model)
Memory size
Storage(size, model(hdd, ssd, nvme))
GPU(model, numbers, memory)

So Lagrange server will need to store these resources in DB and assign the computing task to a suitable CP @flin-nbai when you design server DB and bidding system, please consider this

flin-nbai commented 1 year ago

@Normalnoise If possible, could you provide a sample request that the CP would send to the server?

Normalnoise commented 1 year ago

@flin-nbai we will provide a sample here once the request is ready. @sonic-chain is working on it right now.

sonic-chain commented 1 year ago

@flin-nbai This is a request sample:

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"North America",
    "cluster_info":[
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":192,
            "memory":100000,
            "gpu":{
                "model":"Nvidia GeForce RTX 3080:8704",
                "size":4
            },
            "storage":{
                "type":"SSD",
                "size":500000
            }
        },
        {
            "machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
            "cpu":128,
            "memory":100000,
            "gpu":{
                "model":"Nvidia GeForce RTX 2080 Ti:4352",
                "size":1
            },
            "storage":{
                "type":"NVME",
                "size":500000
            }
        }
    ]
}

flin-nbai commented 1 year ago

Some questions:

What does "cpu: 192" mean? What does the 192 indicate?
What units is memory in? What units is storage size and gpu size in?
What is the reasoning for using the heartbeat API instead of the CP update API?

Normalnoise commented 1 year ago

@flin-nbai

192 indicate the CPU has 192 cores
the memory/storage size/ units are byte, gpu size says GPU's counts. (I think we will have some updates)

Normalnoise commented 1 year ago

About the CP update API, Let's have a discussion.

I think it's better to use a new API to update the CP resource

flin-nbai commented 1 year ago

We can use the current update API: POST /cp, what do you think? Also, the request should contain a CPU model field too.

Normalnoise commented 1 year ago

Yes, I think we can use current update API: POST /cp

CPU model will be included, and every resource should be include total and available

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"North America",
    "cluster_info":[
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":{
                "model":"AMD 7542",
                "total_nums":192
                "available_nums:100"
            },
            "memory":{
                "total_memory":100000
                "available_memory":50000
            },
            "gpu":{
                "model":"Nvidia GeForce RTX 3080:8704",
                "total_nums":4
                 "available_nums":2
                "total_memory":100000
                "available_memory":50000
            },
            "storage":{
                "type":"SSD",
                "total_size":500000
                "available_size":100000
            }
        },
        {
            "machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
            "cpu":{
                "model":"AMD 7H12",
                "total_nums":256
                "available_nums:10"
            },
            "memory":{
                "total_memory":100000
                "available_memory":50000
            },
            "gpu":{
                "model":"Nvidia GeForce RTX 2080 Ti:4352",
                "total_nums":4
                 "available_nums":2
                "total_memory":100000
                "available_memory":50000
            },
            "storage":{
                "type":"NVME",
                "total_size":500000
                "available_size":100000
            }
        }
    ]
}

flin-nbai commented 1 year ago

If we go with using the current update API, the api expects these other fields to also be provided in the request (Name, MultiAddress, Autobid) https://github.com/lagrangedao/go-computing-provider/blob/4e74a6b838298467bd6d281508cf7f65c33213b3/computing/provider.go#L30

flyworker commented 1 year ago

i think for the CPU resource, CP reported the hardware CPU but we need to convert it to vCPU when we do matching. https://cloud.google.com/architecture/resource-mappings-from-on-premises-hardware-to-gcp

flyworker commented 1 year ago

If we go with using the current update API, the api expects these other fields to also be provided in the request (Name, MultiAddress, Autobid)

https://github.com/lagrangedao/go-computing-provider/blob/4e74a6b838298467bd6d281508cf7f65c33213b3/computing/provider.go#L30

I agree

flyworker commented 1 year ago

Yes, I think we can use current update API: POST /cp

CPU model will be included, and every resource should be include total and available

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"North America",
    "cluster_info":[
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":{
                "model":"AMD 7542",
                "total_nums":192
                "available_nums:100"
            },
            "memory":{
                "total_memory":100000
                "available_memory":50000
            },
            "gpu":{
                "model":"Nvidia GeForce RTX 3080:8704",
                "total_nums":4
                 "available_nums":2
                "total_memory":100000
                "available_memory":50000
            },
            "storage":{
                "type":"SSD",
                "total_size":500000
                "available_size":100000
            }
        },
        {
            "machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
            "cpu":{
                "model":"AMD 7H12",
                "total_nums":256
                "available_nums:10"
            },
            "memory":{
                "total_memory":100000
                "available_memory":50000
            },
            "gpu":{
                "model":"Nvidia GeForce RTX 2080 Ti:4352",
                "total_nums":4
                 "available_nums":2
                "total_memory":100000
                "available_memory":50000
            },
            "storage":{
                "type":"NVME",
                "total_size":500000
                "available_size":100000
            }
        }
    ]
}

this GPU should be a json list since you can have 1 RTX 3070 + 1 RTX 3080

flin-nbai commented 1 year ago

Based off the above comment, is the request format going to change then? I need to change the server update API to handle this new format. Could you provide an updated request sample once ready @sonic-chain @Normalnoise?

sonic-chain commented 1 year ago

New sample:

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"US-VA",
    "cluster_info":[
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":{
                "model":"AMD",
                "total_nums":192,
                "available_nums":174
            },
            "memory":{
                "total_memory":2151473061888,
                "available_memory":2140926484480
            },
            "gpu":{
                "total_nums":2,
                "available_nums":2,
                "total_memory":30018,
                "available_memory":20000,
                "details":[
                    {
                        "model":"NVIDIA-GeForce-RTX-3080",
                        "count":1
                    },
                    {
                        "model":"NVIDIA-GeForce-RTX-3090",
                        "count":1
                    }
                ]
            },
            "storage":{
                "type":"",
                "total_size":0,
                "available_size":0
            }
        }
    ]
}

flin-nbai commented 1 year ago

@sonic-chain If a machine has 2 GPUs, but only 1 is available, how would I know which GPU is the available one? For example:

 "gpu":{
            "available_nums":1,
            "details":[
                {
                    "model":"NVIDIA-GeForce-RTX-3080",
                    "count":1
                },
                {
                    "model":"NVIDIA-GeForce-RTX-3090",
                    "count":1
                }
            ]
        }

How do I know if I have 3080 or 3090 available? I think request format needs to be updated to account for this.

Also, I think total and available memory should be reported per graphics card, not the total for the machine. If I want 20000 memory, but each GPU only has 10000 and I only ask for 1 GPU, then I have no way of knowing how to fulfill this request. The total available memory is 20000, but I don't know which graphics card has this much memory.

Overall, I think GPU in the request should look something like this for example

"gpu":{
            "total_nums":3,
            "available_nums":2,
            "total_memory":30,
            "available_memory":20,
            "details":[
                {
                    "model":"NVIDIA-GeForce-RTX-3080",
                    "total_memory": 10,
                    "available_memory": 10
                },
                {
                    "model":"NVIDIA-GeForce-RTX-3090",
                    "total_memory": 10,
                    "available_memory": 10
                },
                {
                    "model":"NVIDIA-GeForce-RTX-3090",  // This GPU is busy and not available
                    "total_memory": 10,
                    "available_memory": 0
                }
            ]
        }

Does this make sense? What do you think?

sonic-chain commented 1 year ago

This looks to work; but, I may need to adjust how I collect it. The requested data format may be in accordance with what you said above.

sonic-chain commented 1 year ago

When the API for receiving requests on the server side is implemented, please notify me, and I will pass it to you using the request body parameters listed below.

{
    "node_id":"04629061ac5a97fa63b8325f115ba70ec6b733a569942a13b9b5aa454b36b0bf633a430ce6ce2468a7e3bed2b9beae5e685b599bf1a952893665500e9b405a2222",
    "region":"CA-QC",
    "cluster_info":[
        {
            "machine_id":"315ae8c203ec4b3aa9bf7dd9bd96cec0",
            "cpu":{
                "model":"AMD",
                "total_nums":96,
                "available_nums":76
            },
            "memory":{
                "total_memory":270372970496,
                "available_memory":234593460224
            },
            "gpu":{
                "total_nums":2,
                "available_nums":1,
                "total_memory":20036,
                "available_memory":10018,
                "details":[
                    {
                        "model":"NVIDIA-GeForce-RTX-3080",
                        "total_memory":10018,
                        "available_memory":10018
                    },
                    {
                        "model":"NVIDIA-GeForce-RTX-3080",
                        "total_memory":10018,
                        "available_memory":0
                    }
                ]
            },
            "storage":{
                "type":"",
                "total_size":0,
                "available_size":0
            }
        }
    ]
}

flyworker commented 1 year ago

are those from k8s or bare metal?

flyworker commented 1 year ago

reference: https://lampaa.medium.com/monitoring-nvidia-gpus-using-rest-api-b747363cfe5

{ "node_id": "04629061ac5a97fa63b8325f115ba70ec6b733a569942a13b9b5aa454b36b0bf633a430ce6ce2468a7e3bed2b9beae5e685b599bf1a952893665500e9b405a2222", "region": "CA-QC", "cluster_info": [ { "machine_id": "315ae8c203ec4b3aa9bf7dd9bd96cec0", "cpu": { "model": "AMD", "total_nums": 96, "available_nums": 76 }, "memory": { "total": 270372970496, "available": 234593460224 }, "gpu": { "driver_verison": "436.10", "cuda_version": "10", "attached_gpus": 3, "details": [ { "product_name": "NVIDIA-GeForce-RTX-3080", "fb_memory_usage": [ { "total": [ "4036 MiB" ], "used": [ "0 MiB" ], "free": [ "4036 MiB" ] } ], "bar1_memory_usage": [ { "total": [ "128 MiB" ], "used": [ "2 MiB" ], "free": [ "126 MiB" ] } ] }, { "product_name": "NVIDIA-GeForce-RTX-3070", "fb_memory_usage": [ { "total": [ "4036 MiB" ], "used": [ "0 MiB" ], "free": [ "4036 MiB" ] } ], "bar1_memory_usage": [ { "total": [ "128 MiB" ], "used": [ "2 MiB" ], "free": [ "126 MiB" ] } ] } ] }, "storage": { "type": "", "total_size": 0, "available_size": 0 } } ] }

You can also find a online-editor version here:

https://jsoneditoronline.org/#left=cloud.c63a628e585548feb54378ae941e65b9

sonic-chain commented 1 year ago

The information given above was collected from k8s.

flyworker commented 1 year ago

https://developer.vmware.com/apis/vsphere-automation/latest/vcenter/api/vcenter/vm/vm/get/

$ curl -X GET \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ https://api.k8s.example.com/apis/cluster.openai.com/v1/clusters//status

sonic-chain commented 1 year ago

new format:

{
    "node_id":"04629061ac5a97fa63b8325f115ba70ec6b733a569942a13b9b5aa454b36b0bf633a430ce6ce2468a7e3bed2b9beae5e685b599bf1a952893665500e9b405a2222",
    "region":"CA-QC",
    "cluster_info":[
        {
            "machine_id":"315ae8c203ec4b3aa9bf7dd9bd96cec0",
            "model":"AMD",
            "cpu":{
                "total":96,
                "used":76,
                "free":20
            },
            "vcpu":{
                "total":96,
                "used":10,
                "free":86
            },
            "memory":{
                "total":"2700 MiB",
                "used":"1000 MiB",
                "free":"234593 MiB"
            },
            "gpu":{
                "driver_verison":"436.10",
                "cuda_version":"10",
                "attached_gpus":3,
                "details":[
                    {
                        "product_name":"NVIDIA-GeForce-RTX-3080",
                        "fb_memory_usage":{
                            "total":"4036 MiB",
                            "used":"0 MiB",
                            "free":"4036 MiB"
                        },
                        "bar1_memory_usage":{
                            "total":"128 MiB",
                            "used":"2 MiB",
                            "free":"126 MiB"
                        }
                    },
                    {
                        "product_name":"NVIDIA-GeForce-RTX-3070",
                        "fb_memory_usage":{
                            "total":"4036 MiB",
                            "used":"0 MiB",
                            "free":"4036 MiB"
                        },
                        "bar1_memory_usage":{
                            "total":"128 MiB",
                            "used":"2 MiB",
                            "free":"126 MiB"
                        }
                    }
                ]
            },
            "storage":{
                "type":"SSD",
                "total":"100 GiB",
                "uesed":"50 GiB",
                "free":"50 GiB"
            }
        }
    ]
}

lagrangedao / go-computing-provider

feat: report CP's resource to Lagrange server according to the heartbeat #7