Closed eero-t closed 7 months ago
Hi, @eero-t , the telemetry metrics in the installation guide and user guide are separated defined. For example, throttle ratio is defined in the installation guide to have end user to enable the throttle ratio collection in XPU Manager daemon. However, we think that it is not useful for CLI end users and don't provide it in CLI. As a result, throttle ratio is not written in CLI user guide.
Ok, fair enough. What about the CSV list?
Which of the documents should list all the supported metrics? And could that be linked from the other places mentioning metrics (see also #24)?
Those docs do not list error counters as supported for Flex, but RAS works fine for me on them (with i915 backport kernel, as long as Sysman is run as root with PERFMON capability):
# zello_sysman --ras
setting environment variable ZES_ENABLE_SYSMAN=1
Device Name = Intel(R) Data Center GPU Flex 170
Device Name = Intel(R) Data Center GPU Flex 170
---- Ras tests ----
rasProperties.type = 0
Number of correctable accelerator engine resets attempted by the driver = 0
Number of correctable errors that have occurred in caches = 0
Number of correctable programming errors that have occurred = 0
Number of correctable driver errors that have occurred = 0
Number of correctable compute errors that have occurred = 0
Number of correctable non compute errors that have occurred = 0
Number of correctable display errors that have occurred = 0
Setting Total threshold = 14
...
?
I would also suggest either having all metrics common to both platforms listed before ones specific to them, and/or grouping related metric together in these lists, e.g.:
And:
Compared following documents:
And which metrics they list XPU manager to provide. Especially CSV file info seems very out of data, but also install guide eg. lists frequency throttle ratio (as not supported by current L0 backend), but not user guide. IMHO it would be better to have supported metrics list in single place, and to refer to that from the other documents.