intel / xpumanager

MIT License
87 stars 19 forks source link

Documentation mismatches in regards to what metrics XPUM supports #20

Closed eero-t closed 7 months ago

eero-t commented 1 year ago

Compared following documents:

And which metrics they list XPU manager to provide. Especially CSV file info seems very out of data, but also install guide eg. lists frequency throttle ratio (as not supported by current L0 backend), but not user guide. IMHO it would be better to have supported metrics list in single place, and to refer to that from the other documents.

taotod commented 1 year ago

Hi, @eero-t , the telemetry metrics in the installation guide and user guide are separated defined. For example, throttle ratio is defined in the installation guide to have end user to enable the throttle ratio collection in XPU Manager daemon. However, we think that it is not useful for CLI end users and don't provide it in CLI. As a result, throttle ratio is not written in CLI user guide.

eero-t commented 1 year ago

Ok, fair enough. What about the CSV list?

Which of the documents should list all the supported metrics? And could that be linked from the other places mentioning metrics (see also #24)?

taotod commented 7 months ago

Added into CLI user guide https://github.com/intel/xpumanager/blob/master/doc/CLI_user_guide.md#the-statistics-supported-by-intel-data-center-gpus https://github.com/intel/xpumanager/blob/master/doc/smi_user_guide.md#the-statistics-supported-by-intel-data-center-gpus

eero-t commented 7 months ago

Those docs do not list error counters as supported for Flex, but RAS works fine for me on them (with i915 backport kernel, as long as Sysman is run as root with PERFMON capability):

# zello_sysman --ras
setting environment variable ZES_ENABLE_SYSMAN=1
Device Name = Intel(R) Data Center GPU Flex 170
Device Name = Intel(R) Data Center GPU Flex 170

 ----  Ras tests ---- 
rasProperties.type = 0
Number of correctable accelerator engine resets attempted by the driver = 0
Number of correctable errors that have occurred in caches = 0
Number of correctable programming errors that have occurred  = 0
Number of correctable driver errors that have occurred  = 0
Number of correctable compute errors that have occurred  = 0
Number of correctable non compute errors that have occurred  = 0
Number of correctable display errors that have occurred  = 0
Setting Total threshold = 14
...

?

I would also suggest either having all metrics common to both platforms listed before ones specific to them, and/or grouping related metric together in these lists, e.g.:

And: