lagrangedao / go-computing-provider

A golang implementation of computing provider
MIT License
2 stars 36 forks source link

feat: report CP's resource to Lagrange server according to the heartbeat #7

Open sonic-chain opened 1 year ago

sonic-chain commented 1 year ago

The cp heartbeat reporting node's resource information includes information about its CPU (number of cores, model), memory size, disk size, and Gpu (model, quantity, memory size). Since a k8s cluster has one or more nodes, so it's reported a list: each cluster node's resource information.

Normalnoise commented 1 year ago

The CP plans to report the resources to the server by heartbeat. Because one CP maybe have many nodes(physical machines), the resources will be a resource list, every resource will include:

So Lagrange server will need to store these resources in DB and assign the computing task to a suitable CP @flin-nbai when you design server DB and bidding system, please consider this

flin-nbai commented 1 year ago

@Normalnoise If possible, could you provide a sample request that the CP would send to the server?

Normalnoise commented 1 year ago

@flin-nbai we will provide a sample here once the request is ready. @sonic-chain is working on it right now.

sonic-chain commented 1 year ago

@flin-nbai This is a request sample:

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"North America",
    "cluster_info":[
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":192,
            "memory":100000,
            "gpu":{
                "model":"Nvidia GeForce RTX 3080:8704",
                "size":4
            },
            "storage":{
                "type":"SSD",
                "size":500000
            }
        },
        {
            "machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
            "cpu":128,
            "memory":100000,
            "gpu":{
                "model":"Nvidia GeForce RTX 2080 Ti:4352",
                "size":1
            },
            "storage":{
                "type":"NVME",
                "size":500000
            }
        }
    ]
}
flin-nbai commented 1 year ago

Some questions:

Normalnoise commented 1 year ago

@flin-nbai

Normalnoise commented 1 year ago

About the CP update API, Let's have a discussion.

I think it's better to use a new API to update the CP resource

flin-nbai commented 1 year ago

We can use the current update API: POST /cp, what do you think? Also, the request should contain a CPU model field too.

Normalnoise commented 1 year ago

Yes, I think we can use current update API: POST /cp

CPU model will be included, and every resource should be include total and available

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"North America",
    "cluster_info":[
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":{
                "model":"AMD 7542",
                "total_nums":192
                "available_nums:100"
            },
            "memory":{
                "total_memory":100000
                "available_memory":50000
            },
            "gpu":{
                "model":"Nvidia GeForce RTX 3080:8704",
                "total_nums":4
                 "available_nums":2
                "total_memory":100000
                "available_memory":50000
            },
            "storage":{
                "type":"SSD",
                "total_size":500000
                "available_size":100000
            }
        },
        {
            "machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
            "cpu":{
                "model":"AMD 7H12",
                "total_nums":256
                "available_nums:10"
            },
            "memory":{
                "total_memory":100000
                "available_memory":50000
            },
            "gpu":{
                "model":"Nvidia GeForce RTX 2080 Ti:4352",
                "total_nums":4
                 "available_nums":2
                "total_memory":100000
                "available_memory":50000
            },
            "storage":{
                "type":"NVME",
                "total_size":500000
                "available_size":100000
            }
        }
    ]
}
flin-nbai commented 1 year ago

If we go with using the current update API, the api expects these other fields to also be provided in the request (Name, MultiAddress, Autobid) https://github.com/lagrangedao/go-computing-provider/blob/4e74a6b838298467bd6d281508cf7f65c33213b3/computing/provider.go#L30

flyworker commented 1 year ago

i think for the CPU resource, CP reported the hardware CPU but we need to convert it to vCPU when we do matching. https://cloud.google.com/architecture/resource-mappings-from-on-premises-hardware-to-gcp

flyworker commented 1 year ago

If we go with using the current update API, the api expects these other fields to also be provided in the request (Name, MultiAddress, Autobid)

https://github.com/lagrangedao/go-computing-provider/blob/4e74a6b838298467bd6d281508cf7f65c33213b3/computing/provider.go#L30

I agree

flyworker commented 1 year ago

Yes, I think we can use current update API: POST /cp

CPU model will be included, and every resource should be include total and available

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"North America",
    "cluster_info":[
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":{
                "model":"AMD 7542",
                "total_nums":192
                "available_nums:100"
            },
            "memory":{
                "total_memory":100000
                "available_memory":50000
            },
            "gpu":{
                "model":"Nvidia GeForce RTX 3080:8704",
                "total_nums":4
                 "available_nums":2
                "total_memory":100000
                "available_memory":50000
            },
            "storage":{
                "type":"SSD",
                "total_size":500000
                "available_size":100000
            }
        },
        {
            "machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
            "cpu":{
                "model":"AMD 7H12",
                "total_nums":256
                "available_nums:10"
            },
            "memory":{
                "total_memory":100000
                "available_memory":50000
            },
            "gpu":{
                "model":"Nvidia GeForce RTX 2080 Ti:4352",
                "total_nums":4
                 "available_nums":2
                "total_memory":100000
                "available_memory":50000
            },
            "storage":{
                "type":"NVME",
                "total_size":500000
                "available_size":100000
            }
        }
    ]
}

this GPU should be a json list since you can have 1 RTX 3070 + 1 RTX 3080

flin-nbai commented 1 year ago

Based off the above comment, is the request format going to change then? I need to change the server update API to handle this new format. Could you provide an updated request sample once ready @sonic-chain @Normalnoise?

sonic-chain commented 1 year ago

New sample:

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"US-VA",
    "cluster_info":[
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":{
                "model":"AMD",
                "total_nums":192,
                "available_nums":174
            },
            "memory":{
                "total_memory":2151473061888,
                "available_memory":2140926484480
            },
            "gpu":{
                "total_nums":2,
                "available_nums":2,
                "total_memory":30018,
                "available_memory":20000,
                "details":[
                    {
                        "model":"NVIDIA-GeForce-RTX-3080",
                        "count":1
                    },
                    {
                        "model":"NVIDIA-GeForce-RTX-3090",
                        "count":1
                    }
                ]
            },
            "storage":{
                "type":"",
                "total_size":0,
                "available_size":0
            }
        }
    ]
}
flin-nbai commented 1 year ago

@sonic-chain If a machine has 2 GPUs, but only 1 is available, how would I know which GPU is the available one? For example:

 "gpu":{
            "available_nums":1,
            "details":[
                {
                    "model":"NVIDIA-GeForce-RTX-3080",
                    "count":1
                },
                {
                    "model":"NVIDIA-GeForce-RTX-3090",
                    "count":1
                }
            ]
        }

How do I know if I have 3080 or 3090 available? I think request format needs to be updated to account for this.

Also, I think total and available memory should be reported per graphics card, not the total for the machine. If I want 20000 memory, but each GPU only has 10000 and I only ask for 1 GPU, then I have no way of knowing how to fulfill this request. The total available memory is 20000, but I don't know which graphics card has this much memory.

Overall, I think GPU in the request should look something like this for example

"gpu":{
            "total_nums":3,
            "available_nums":2,
            "total_memory":30,
            "available_memory":20,
            "details":[
                {
                    "model":"NVIDIA-GeForce-RTX-3080",
                    "total_memory": 10,
                    "available_memory": 10
                },
                {
                    "model":"NVIDIA-GeForce-RTX-3090",
                    "total_memory": 10,
                    "available_memory": 10
                },
                {
                    "model":"NVIDIA-GeForce-RTX-3090",  // This GPU is busy and not available
                    "total_memory": 10,
                    "available_memory": 0
                }
            ]
        }

Does this make sense? What do you think?

sonic-chain commented 1 year ago

This looks to work; but, I may need to adjust how I collect it. The requested data format may be in accordance with what you said above.

sonic-chain commented 1 year ago

When the API for receiving requests on the server side is implemented, please notify me, and I will pass it to you using the request body parameters listed below.

{
    "node_id":"04629061ac5a97fa63b8325f115ba70ec6b733a569942a13b9b5aa454b36b0bf633a430ce6ce2468a7e3bed2b9beae5e685b599bf1a952893665500e9b405a2222",
    "region":"CA-QC",
    "cluster_info":[
        {
            "machine_id":"315ae8c203ec4b3aa9bf7dd9bd96cec0",
            "cpu":{
                "model":"AMD",
                "total_nums":96,
                "available_nums":76
            },
            "memory":{
                "total_memory":270372970496,
                "available_memory":234593460224
            },
            "gpu":{
                "total_nums":2,
                "available_nums":1,
                "total_memory":20036,
                "available_memory":10018,
                "details":[
                    {
                        "model":"NVIDIA-GeForce-RTX-3080",
                        "total_memory":10018,
                        "available_memory":10018
                    },
                    {
                        "model":"NVIDIA-GeForce-RTX-3080",
                        "total_memory":10018,
                        "available_memory":0
                    }
                ]
            },
            "storage":{
                "type":"",
                "total_size":0,
                "available_size":0
            }
        }
    ]
}
flyworker commented 1 year ago

are those from k8s or bare metal?

flyworker commented 1 year ago

reference: https://lampaa.medium.com/monitoring-nvidia-gpus-using-rest-api-b747363cfe5

{ "node_id": "04629061ac5a97fa63b8325f115ba70ec6b733a569942a13b9b5aa454b36b0bf633a430ce6ce2468a7e3bed2b9beae5e685b599bf1a952893665500e9b405a2222", "region": "CA-QC", "cluster_info": [ { "machine_id": "315ae8c203ec4b3aa9bf7dd9bd96cec0", "cpu": { "model": "AMD", "total_nums": 96, "available_nums": 76 }, "memory": { "total": 270372970496, "available": 234593460224 }, "gpu": { "driver_verison": "436.10", "cuda_version": "10", "attached_gpus": 3, "details": [ { "product_name": "NVIDIA-GeForce-RTX-3080", "fb_memory_usage": [ { "total": [ "4036 MiB" ], "used": [ "0 MiB" ], "free": [ "4036 MiB" ] } ], "bar1_memory_usage": [ { "total": [ "128 MiB" ], "used": [ "2 MiB" ], "free": [ "126 MiB" ] } ] }, { "product_name": "NVIDIA-GeForce-RTX-3070", "fb_memory_usage": [ { "total": [ "4036 MiB" ], "used": [ "0 MiB" ], "free": [ "4036 MiB" ] } ], "bar1_memory_usage": [ { "total": [ "128 MiB" ], "used": [ "2 MiB" ], "free": [ "126 MiB" ] } ] } ] }, "storage": { "type": "", "total_size": 0, "available_size": 0 } } ] }

You can also find a online-editor version here:

https://jsoneditoronline.org/#left=cloud.c63a628e585548feb54378ae941e65b9

sonic-chain commented 1 year ago

The information given above was collected from k8s.

flyworker commented 1 year ago

https://developer.vmware.com/apis/vsphere-automation/latest/vcenter/api/vcenter/vm/vm/get/

$ curl -X GET \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ https://api.k8s.example.com/apis/cluster.openai.com/v1/clusters//status

sonic-chain commented 1 year ago

new format:

{
    "node_id":"04629061ac5a97fa63b8325f115ba70ec6b733a569942a13b9b5aa454b36b0bf633a430ce6ce2468a7e3bed2b9beae5e685b599bf1a952893665500e9b405a2222",
    "region":"CA-QC",
    "cluster_info":[
        {
            "machine_id":"315ae8c203ec4b3aa9bf7dd9bd96cec0",
            "model":"AMD",
            "cpu":{
                "total":96,
                "used":76,
                "free":20
            },
            "vcpu":{
                "total":96,
                "used":10,
                "free":86
            },
            "memory":{
                "total":"2700 MiB",
                "used":"1000 MiB",
                "free":"234593 MiB"
            },
            "gpu":{
                "driver_verison":"436.10",
                "cuda_version":"10",
                "attached_gpus":3,
                "details":[
                    {
                        "product_name":"NVIDIA-GeForce-RTX-3080",
                        "fb_memory_usage":{
                            "total":"4036 MiB",
                            "used":"0 MiB",
                            "free":"4036 MiB"
                        },
                        "bar1_memory_usage":{
                            "total":"128 MiB",
                            "used":"2 MiB",
                            "free":"126 MiB"
                        }
                    },
                    {
                        "product_name":"NVIDIA-GeForce-RTX-3070",
                        "fb_memory_usage":{
                            "total":"4036 MiB",
                            "used":"0 MiB",
                            "free":"4036 MiB"
                        },
                        "bar1_memory_usage":{
                            "total":"128 MiB",
                            "used":"2 MiB",
                            "free":"126 MiB"
                        }
                    }
                ]
            },
            "storage":{
                "type":"SSD",
                "total":"100 GiB",
                "uesed":"50 GiB",
                "free":"50 GiB"
            }
        }
    ]
}