Closed achimnol closed 3 years ago
Live stats and plugin-specific extra information is now implemented via lablup/backend.ai-agent#109 and related updates to the manager code. Let's elaborate hardware part a little bit more, to include specific model names for instance.
The commit 0b135ec79 (#154) partially resolves this by storing additional "attached_device" information in the kernels table so that admins can inspect which type of GPU models are used for each kernel session.
Though, this issue is kept open since it targets providing the "current" information of physical devices of agents while the commit 0b135ec79 stores historical device usage information of kernel sessions.
In the context of "H" project, this is now resolved.
This is now implemented as gather_hwinfo()
RPC API of agents.
In the admin GraphQL queries for agents, let's include physical hardware information including:
nvidia-smi
In conjunction with lablup/backend.ai-manager#103, let's add the followings to the agent:
collect_live_stats()
abstract method toAbstractComputeDevice
get_physical_info()
abstract method toAbstractComputeDevice
collect_live_stats_summary()
abstract method toAbstractComputePlugin
get_physical_info_summary()
abstract method toAbstractComputePlugin
All above methods should return an arbitrary JSON-serializable dict.