lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
508 stars 150 forks source link

Add physical hardware information for admins #154

Closed achimnol closed 3 years ago

achimnol commented 5 years ago

In the admin GraphQL queries for agents, let's include physical hardware information including:

In conjunction with lablup/backend.ai-manager#103, let's add the followings to the agent:

All above methods should return an arbitrary JSON-serializable dict.

achimnol commented 5 years ago

Live stats and plugin-specific extra information is now implemented via lablup/backend.ai-agent#109 and related updates to the manager code. Let's elaborate hardware part a little bit more, to include specific model names for instance.

achimnol commented 5 years ago

The commit 0b135ec79 (#154) partially resolves this by storing additional "attached_device" information in the kernels table so that admins can inspect which type of GPU models are used for each kernel session.

Though, this issue is kept open since it targets providing the "current" information of physical devices of agents while the commit 0b135ec79 stores historical device usage information of kernel sessions.

achimnol commented 5 years ago

In the context of "H" project, this is now resolved.

achimnol commented 3 years ago

This is now implemented as gather_hwinfo() RPC API of agents.