Open eero-t opened 1 year ago
I think this should and could be improved.
Many thanks for Eero's suggestion. The XPU Manager dump telemetry feature has dependency on this sampling period. The default interface of dumping telemetries is 1 second and it aggregates the raw data sampled by XPU Manager daemon. We need sample period to evenly divide 1 second, so we have the limited sampling periods.
Are you saying that this is interval for querying data from xpumd
, not the query interval used by xpumd
itself?
This ticket is about HW query interval used by xpumd
. Depending on what metrics one is interested about, and for what purpose, user may want xpumd
internal HW query interval to be longer (to save power, e.g. when cluster node is otherwise idle), or shorter (for more accurate data, when not saving power).
xpumd
should error only if requested internal HW query internal is too long for the selected metrics i.e. they could overflow more than once.
Not sure if it is the real requirement. We will wait for the real customers' feedback.
Not sure if it is the real requirement. We will wait for the real customers' feedback.
I think most just shutdown idle nodes completely, but some may be interested in active power management. XPUM might prevent GPUs (and one of the CPU cores) from suspending, depending on how many metrics it queries from them and how often.
Currently "xpumd" internal sampling interval can be set only using "xpumcli agentset -t" external command.
While its nice to be able to change that at run-time, it should be possible to set the interval also directly from "xpum" command line.
In some situations using external utility can be either awkward, or a potential security issue, compared to just restarting "xpumd" container with a new sampling interval option value.
Currently supported set of sampling intervals is also very limited:
IMHO it would be better to allow any value, and return error only when counters for the selected metrics can overflow with that interval.