Open sonic-chain opened 1 year ago
The CP plans to report the resources to the server by heartbeat. Because one CP maybe have many nodes(physical machines), the resources will be a resource list, every resource will include:
So Lagrange server will need to store these resources in DB and assign the computing task to a suitable CP @flin-nbai when you design server DB and bidding system, please consider this
@Normalnoise If possible, could you provide a sample request that the CP would send to the server?
@flin-nbai we will provide a sample here once the request is ready. @sonic-chain is working on it right now.
@flin-nbai This is a request sample:
{
"node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
"region":"North America",
"cluster_info":[
{
"machine_id":"1421c9f90e414825856f936fa5bbf649",
"cpu":192,
"memory":100000,
"gpu":{
"model":"Nvidia GeForce RTX 3080:8704",
"size":4
},
"storage":{
"type":"SSD",
"size":500000
}
},
{
"machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
"cpu":128,
"memory":100000,
"gpu":{
"model":"Nvidia GeForce RTX 2080 Ti:4352",
"size":1
},
"storage":{
"type":"NVME",
"size":500000
}
}
]
}
Some questions:
@flin-nbai
byte
, gpu size says GPU's counts. (I think we will have some updates)About the CP update API
, Let's have a discussion.
I think it's better to use a new API to update the CP resource
We can use the current update API: POST /cp
, what do you think?
Also, the request should contain a CPU model field too.
Yes, I think we can use current update API: POST /cp
CPU model will be included, and every resource should be include total and available
{
"node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
"region":"North America",
"cluster_info":[
{
"machine_id":"1421c9f90e414825856f936fa5bbf649",
"cpu":{
"model":"AMD 7542",
"total_nums":192
"available_nums:100"
},
"memory":{
"total_memory":100000
"available_memory":50000
},
"gpu":{
"model":"Nvidia GeForce RTX 3080:8704",
"total_nums":4
"available_nums":2
"total_memory":100000
"available_memory":50000
},
"storage":{
"type":"SSD",
"total_size":500000
"available_size":100000
}
},
{
"machine_id":"ddee71c469ed4876bcb40f92b0e48a60",
"cpu":{
"model":"AMD 7H12",
"total_nums":256
"available_nums:10"
},
"memory":{
"total_memory":100000
"available_memory":50000
},
"gpu":{
"model":"Nvidia GeForce RTX 2080 Ti:4352",
"total_nums":4
"available_nums":2
"total_memory":100000
"available_memory":50000
},
"storage":{
"type":"NVME",
"total_size":500000
"available_size":100000
}
}
]
}
If we go with using the current update API, the api expects these other fields to also be provided in the request (Name, MultiAddress, Autobid) https://github.com/lagrangedao/go-computing-provider/blob/4e74a6b838298467bd6d281508cf7f65c33213b3/computing/provider.go#L30
i think for the CPU resource, CP reported the hardware CPU but we need to convert it to vCPU when we do matching. https://cloud.google.com/architecture/resource-mappings-from-on-premises-hardware-to-gcp
If we go with using the current update API, the api expects these other fields to also be provided in the request (Name, MultiAddress, Autobid)
I agree
Yes, I think we can use current update API: POST /cp
CPU model will be included, and every resource should be include total and available
{ "node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc", "region":"North America", "cluster_info":[ { "machine_id":"1421c9f90e414825856f936fa5bbf649", "cpu":{ "model":"AMD 7542", "total_nums":192 "available_nums:100" }, "memory":{ "total_memory":100000 "available_memory":50000 }, "gpu":{ "model":"Nvidia GeForce RTX 3080:8704", "total_nums":4 "available_nums":2 "total_memory":100000 "available_memory":50000 }, "storage":{ "type":"SSD", "total_size":500000 "available_size":100000 } }, { "machine_id":"ddee71c469ed4876bcb40f92b0e48a60", "cpu":{ "model":"AMD 7H12", "total_nums":256 "available_nums:10" }, "memory":{ "total_memory":100000 "available_memory":50000 }, "gpu":{ "model":"Nvidia GeForce RTX 2080 Ti:4352", "total_nums":4 "available_nums":2 "total_memory":100000 "available_memory":50000 }, "storage":{ "type":"NVME", "total_size":500000 "available_size":100000 } } ] }
this GPU should be a json list since you can have 1 RTX 3070 + 1 RTX 3080
Based off the above comment, is the request format going to change then? I need to change the server update API to handle this new format. Could you provide an updated request sample once ready @sonic-chain @Normalnoise?
New sample:
{
"node_id":"04f532cea9ad16d450e9e2d3f94694aa3549c18c36fa00121ed70a4ab40d64f912d751584a0614d025982d8e00191d844d0371426d7304cc63d1f5abca288480fc",
"region":"US-VA",
"cluster_info":[
{
"machine_id":"1421c9f90e414825856f936fa5bbf649",
"cpu":{
"model":"AMD",
"total_nums":192,
"available_nums":174
},
"memory":{
"total_memory":2151473061888,
"available_memory":2140926484480
},
"gpu":{
"total_nums":2,
"available_nums":2,
"total_memory":30018,
"available_memory":20000,
"details":[
{
"model":"NVIDIA-GeForce-RTX-3080",
"count":1
},
{
"model":"NVIDIA-GeForce-RTX-3090",
"count":1
}
]
},
"storage":{
"type":"",
"total_size":0,
"available_size":0
}
}
]
}
@sonic-chain If a machine has 2 GPUs, but only 1 is available, how would I know which GPU is the available one? For example:
"gpu":{
"available_nums":1,
"details":[
{
"model":"NVIDIA-GeForce-RTX-3080",
"count":1
},
{
"model":"NVIDIA-GeForce-RTX-3090",
"count":1
}
]
}
How do I know if I have 3080 or 3090 available? I think request format needs to be updated to account for this.
Also, I think total and available memory should be reported per graphics card, not the total for the machine. If I want 20000 memory, but each GPU only has 10000 and I only ask for 1 GPU, then I have no way of knowing how to fulfill this request. The total available memory is 20000, but I don't know which graphics card has this much memory.
Overall, I think GPU in the request should look something like this for example
"gpu":{
"total_nums":3,
"available_nums":2,
"total_memory":30,
"available_memory":20,
"details":[
{
"model":"NVIDIA-GeForce-RTX-3080",
"total_memory": 10,
"available_memory": 10
},
{
"model":"NVIDIA-GeForce-RTX-3090",
"total_memory": 10,
"available_memory": 10
},
{
"model":"NVIDIA-GeForce-RTX-3090", // This GPU is busy and not available
"total_memory": 10,
"available_memory": 0
}
]
}
Does this make sense? What do you think?
This looks to work; but, I may need to adjust how I collect it. The requested data format may be in accordance with what you said above.
When the API for receiving requests on the server side is implemented, please notify me, and I will pass it to you using the request body parameters listed below.
{
"node_id":"04629061ac5a97fa63b8325f115ba70ec6b733a569942a13b9b5aa454b36b0bf633a430ce6ce2468a7e3bed2b9beae5e685b599bf1a952893665500e9b405a2222",
"region":"CA-QC",
"cluster_info":[
{
"machine_id":"315ae8c203ec4b3aa9bf7dd9bd96cec0",
"cpu":{
"model":"AMD",
"total_nums":96,
"available_nums":76
},
"memory":{
"total_memory":270372970496,
"available_memory":234593460224
},
"gpu":{
"total_nums":2,
"available_nums":1,
"total_memory":20036,
"available_memory":10018,
"details":[
{
"model":"NVIDIA-GeForce-RTX-3080",
"total_memory":10018,
"available_memory":10018
},
{
"model":"NVIDIA-GeForce-RTX-3080",
"total_memory":10018,
"available_memory":0
}
]
},
"storage":{
"type":"",
"total_size":0,
"available_size":0
}
}
]
}
are those from k8s or bare metal?
reference: https://lampaa.medium.com/monitoring-nvidia-gpus-using-rest-api-b747363cfe5
{ "node_id": "04629061ac5a97fa63b8325f115ba70ec6b733a569942a13b9b5aa454b36b0bf633a430ce6ce2468a7e3bed2b9beae5e685b599bf1a952893665500e9b405a2222", "region": "CA-QC", "cluster_info": [ { "machine_id": "315ae8c203ec4b3aa9bf7dd9bd96cec0", "cpu": { "model": "AMD", "total_nums": 96, "available_nums": 76 }, "memory": { "total": 270372970496, "available": 234593460224 }, "gpu": { "driver_verison": "436.10", "cuda_version": "10", "attached_gpus": 3, "details": [ { "product_name": "NVIDIA-GeForce-RTX-3080", "fb_memory_usage": [ { "total": [ "4036 MiB" ], "used": [ "0 MiB" ], "free": [ "4036 MiB" ] } ], "bar1_memory_usage": [ { "total": [ "128 MiB" ], "used": [ "2 MiB" ], "free": [ "126 MiB" ] } ] }, { "product_name": "NVIDIA-GeForce-RTX-3070", "fb_memory_usage": [ { "total": [ "4036 MiB" ], "used": [ "0 MiB" ], "free": [ "4036 MiB" ] } ], "bar1_memory_usage": [ { "total": [ "128 MiB" ], "used": [ "2 MiB" ], "free": [ "126 MiB" ] } ] } ] }, "storage": { "type": "", "total_size": 0, "available_size": 0 } } ] }
You can also find a online-editor version here:
https://jsoneditoronline.org/#left=cloud.c63a628e585548feb54378ae941e65b9
The information given above was collected from k8s.
https://developer.vmware.com/apis/vsphere-automation/latest/vcenter/api/vcenter/vm/vm/get/
$ curl -X GET \
-H "Authorization: Bearer
new format:
{
"node_id":"04629061ac5a97fa63b8325f115ba70ec6b733a569942a13b9b5aa454b36b0bf633a430ce6ce2468a7e3bed2b9beae5e685b599bf1a952893665500e9b405a2222",
"region":"CA-QC",
"cluster_info":[
{
"machine_id":"315ae8c203ec4b3aa9bf7dd9bd96cec0",
"model":"AMD",
"cpu":{
"total":96,
"used":76,
"free":20
},
"vcpu":{
"total":96,
"used":10,
"free":86
},
"memory":{
"total":"2700 MiB",
"used":"1000 MiB",
"free":"234593 MiB"
},
"gpu":{
"driver_verison":"436.10",
"cuda_version":"10",
"attached_gpus":3,
"details":[
{
"product_name":"NVIDIA-GeForce-RTX-3080",
"fb_memory_usage":{
"total":"4036 MiB",
"used":"0 MiB",
"free":"4036 MiB"
},
"bar1_memory_usage":{
"total":"128 MiB",
"used":"2 MiB",
"free":"126 MiB"
}
},
{
"product_name":"NVIDIA-GeForce-RTX-3070",
"fb_memory_usage":{
"total":"4036 MiB",
"used":"0 MiB",
"free":"4036 MiB"
},
"bar1_memory_usage":{
"total":"128 MiB",
"used":"2 MiB",
"free":"126 MiB"
}
}
]
},
"storage":{
"type":"SSD",
"total":"100 GiB",
"uesed":"50 GiB",
"free":"50 GiB"
}
}
]
}
The cp heartbeat reporting node's resource information includes information about its CPU (number of cores, model), memory size, disk size, and Gpu (model, quantity, memory size). Since a k8s cluster has one or more nodes, so it's reported a list: each cluster node's resource information.