feat: report CP's resource usage

flin-nbai commented 1 year ago

We're creating a dashboard that will list all providers and information about them. Some new metrics we need for each provider are

Their current CPU usage
Their current memory usage
Their total available memory (not to be confused with total remaining)
Current storage amount being used
Total available storage (not to be confused with total remaining)

To get this information in semi-realtime, the CP should have an endpoint that the server can request to get the latest resource usage info from.

If possible, try to report usage only for task (CPU / memory / storage being used by only task, not overall system), but if this is too tricky then overall info is fine.

sonic-chain commented 1 year ago

To report CP's resource, the cp should call the server's API; the server should not use the cp's api to retrieve resources. the api example request arguments are as follows:

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"US-VA",
    "cluster_info":[
        {
            "machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
            "cpu":{
                "model":"",
                "total_nums":192,
                "available_nums":192
            },
            "memory":{
                "total_memory":2151473061888,
                "available_memory":2151473061888
            },
            "gpu":[
                {
                    "model":"",
                    "total_nums":0,
                    "available_nums":0,
                    "total_memory":0,
                    "available_memory":0
                }
            ],
            "storage":{
                "type":"",
                "total_size":0,
                "available_size":0
            }
        },
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":{
                "model":"AMD",
                "total_nums":192,
                "available_nums":174
            },
            "memory":{
                "total_memory":2151473061888,
                "available_memory":2140926484480
            },
            "gpu":[
                {
                    "model":"NVIDIA-GeForce-RTX-3080",
                    "total_nums":1,
                    "available_nums":0,
                    "total_memory":10018,
                    "available_memory":0
                },
                {
                    "model":"NVIDIA-GeForce-RTX-3090",
                    "total_nums":1,
                    "available_nums":0,
                    "total_memory":10018,
                    "available_memory":0
                }
            ],
            "storage":{
                "type":"",
                "total_size":0,
                "available_size":0
            }
        }
    ]
}

flin-nbai commented 1 year ago

To report CP's resource, the cp should call the server's API; the server should not use the cp's api to retrieve resources. the api example request arguments are as follows:

{
    "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
    "region":"US-VA",
    "cluster_info":[
        {
            "machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
            "cpu":{
                "model":"",
                "total_nums":192,
                "available_nums":192
            },
            "memory":{
                "total_memory":2151473061888,
                "available_memory":2151473061888
            },
            "gpu":[
                {
                    "model":"",
                    "total_nums":0,
                    "available_nums":0,
                    "total_memory":0,
                    "available_memory":0
                }
            ],
            "storage":{
                "type":"",
                "total_size":0,
                "available_size":0
            }
        },
        {
            "machine_id":"1421c9f90e414825856f936fa5bbf649",
            "cpu":{
                "model":"AMD",
                "total_nums":192,
                "available_nums":174
            },
            "memory":{
                "total_memory":2151473061888,
                "available_memory":2140926484480
            },
            "gpu":[
                {
                    "model":"NVIDIA-GeForce-RTX-3080",
                    "total_nums":1,
                    "available_nums":0,
                    "total_memory":10018,
                    "available_memory":0
                },
                {
                    "model":"NVIDIA-GeForce-RTX-3090",
                    "total_nums":1,
                    "available_nums":0,
                    "total_memory":10018,
                    "available_memory":0
                }
            ],
            "storage":{
                "type":"",
                "total_size":0,
                "available_size":0
            }
        }
    ]
}

Gotcha. This issue is not needed then

lagrangedao / go-computing-provider

feat: report CP's resource usage #8