CPU Utilization query reporting > 100% for certain instances

Here from SQL Server Community Slack chat: https://app.slack.com/client/T1LTZ0BQV/C1MS1RA4B

Note: I am not currently a SQLWATCH user, but while doing research I found the query being used by SQLWATCH. Leaving this issue here by request of @marcingminski

I came across a bug regarding this commonly used query: https://github.com/marcingminski/sqlwatch/blob/1b3543c290f327a7a918aab8c60a8826ec371cba/SqlWatch.Monitor/Project.SqlWatch.Database/dbo/Procedures/usp_sqlwatch_logger_performance.sql#L62-L79

(Also used in dbadash and other tools/scripts).

It seems there is an issue when running this query against non-hypervisor instances. The ProcessUtilization value does not appear to be 100-based and I have not yet determined what it actually is based on. Just working on theory at this point. The value starts at 0, like normal, but as SQL load increases, the value quickly surpasses where you would expect it to be.

For example, I have an instance with the following stats:

cpu_count                      : 224
hyperthread_ratio              : 56
max_workers_count              : 7552
scheduler_count                : 224
scheduler_total_count          : 315
affinity_type_desc             : AUTO
virtual_machine_type_desc      : NONE
softnuma_configuration_desc    : ON
process_physical_affinity      : {
                                   {0,ffffffffffffff}
                                   {1,ffffffffffffff}
                                   {2,ffffffffffffff}
                                   {3,ffffffffffffff}
                                 }
socket_count                   : 4
cores_per_socket               : 28
numa_node_count                : 16
container_type_desc            : NONE

When I run the linked query above, I get:

percent_processor_time    : 180
percent_idle_time         : 55

My theory is that it might be multiplied by the number of sections in process_physical_affinity, but that is a complete guess for now.

One solution I have considered is to just make the query "less bad". For example, if virtual_machine_type_desc is NONE then do not use ProcessUtilization, instead use 100-SystemIdle, as SystemIdle does seem to reliably be 100-based. Unfortunately this lumps OS load into there, but that seems better than having nothing or something completely wrong.

marcingminski / sqlwatch

CPU Utilization query reporting > 100% for certain instances #491