intel / xpumanager

MIT License
94 stars 20 forks source link

Sampling interval option for "xpumd" #49

Open eero-t opened 1 year ago

eero-t commented 1 year ago

Currently "xpumd" internal sampling interval can be set only using "xpumcli agentset -t" external command.

While its nice to be able to change that at run-time, it should be possible to set the interval also directly from "xpum" command line.

In some situations using external utility can be either awkward, or a potential security issue, compared to just restarting "xpumd" container with a new sampling interval option value.

Currently supported set of sampling intervals is also very limited:

# xpumcli agentset -t 5000
--time: 5000 not in {100,200,500,1000}
Run with --help for more information.

IMHO it would be better to allow any value, and return error only when counters for the selected metrics can overflow with that interval.

donzh commented 1 year ago

I think this should and could be improved.

taotod commented 1 year ago

Many thanks for Eero's suggestion. The XPU Manager dump telemetry feature has dependency on this sampling period. The default interface of dumping telemetries is 1 second and it aggregates the raw data sampled by XPU Manager daemon. We need sample period to evenly divide 1 second, so we have the limited sampling periods.

eero-t commented 1 year ago

Are you saying that this is interval for querying data from xpumd, not the query interval used by xpumd itself?

This ticket is about HW query interval used by xpumd. Depending on what metrics one is interested about, and for what purpose, user may want xpumd internal HW query interval to be longer (to save power, e.g. when cluster node is otherwise idle), or shorter (for more accurate data, when not saving power).

xpumd should error only if requested internal HW query internal is too long for the selected metrics i.e. they could overflow more than once.

taotod commented 10 months ago

Not sure if it is the real requirement. We will wait for the real customers' feedback.

eero-t commented 10 months ago

Not sure if it is the real requirement. We will wait for the real customers' feedback.

I think most just shutdown idle nodes completely, but some may be interested in active power management. XPUM might prevent GPUs (and one of the CPU cores) from suspending, depending on how many metrics it queries from them and how often.