lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
511 stars 152 forks source link

Refactor the per-compute-plugin hardware information #1426

Open achimnol opened 1 year ago

achimnol commented 1 year ago

Currently, we have the following hardware metadata interfaces:

Compute plugin method What's included How's updated/delivered
get_attached_devices() The device list with model names As a part of the return value of kernel creation info
extra_info() Device driver/runtime versions Agent heartbeat (cached in DB) + GraphQL Agent.compute_plugins
get_node_hwinfo() Per-slot-key health info and extra metadata Manager-to-agent RPC call (not cached) + GraphQL Agent.hardware_metadta

Let's clarify the prupose and role of these API as follows:

Yaminyam commented 10 months ago

ref. https://github.com/lablup/backend.ai/pull/1793 As this issue was not resolved, mgr agent info using get_node_hwinfo() was modified. When this issue is resolved in the future, add an option to ping to show hwinfo or add another command to show hwinfo.