marcingminski / sqlwatch

SQL Server Performance Monitor
https://docs.sqlwatch.io
Other
428 stars 168 forks source link

CPU Utilization query reporting > 100% for certain instances #491

Open chadbaldwin opened 3 weeks ago

chadbaldwin commented 3 weeks ago

Here from SQL Server Community Slack chat: https://app.slack.com/client/T1LTZ0BQV/C1MS1RA4B

Note: I am not currently a SQLWATCH user, but while doing research I found the query being used by SQLWATCH. Leaving this issue here by request of @marcingminski

I came across a bug regarding this commonly used query: https://github.com/marcingminski/sqlwatch/blob/1b3543c290f327a7a918aab8c60a8826ec371cba/SqlWatch.Monitor/Project.SqlWatch.Database/dbo/Procedures/usp_sqlwatch_logger_performance.sql#L62-L79

(Also used in dbadash and other tools/scripts).

It seems there is an issue when running this query against non-hypervisor instances. The ProcessUtilization value does not appear to be 100-based and I have not yet determined what it actually is based on. Just working on theory at this point. The value starts at 0, like normal, but as SQL load increases, the value quickly surpasses where you would expect it to be.

For example, I have an instance with the following stats:

cpu_count                      : 224
hyperthread_ratio              : 56
max_workers_count              : 7552
scheduler_count                : 224
scheduler_total_count          : 315
affinity_type_desc             : AUTO
virtual_machine_type_desc      : NONE
softnuma_configuration_desc    : ON
process_physical_affinity      : {
                                   {0,ffffffffffffff}
                                   {1,ffffffffffffff}
                                   {2,ffffffffffffff}
                                   {3,ffffffffffffff}
                                 }
socket_count                   : 4
cores_per_socket               : 28
numa_node_count                : 16
container_type_desc            : NONE

When I run the linked query above, I get:

percent_processor_time    : 180
percent_idle_time         : 55

My theory is that it might be multiplied by the number of sections in process_physical_affinity, but that is a complete guess for now.

One solution I have considered is to just make the query "less bad". For example, if virtual_machine_type_desc is NONE then do not use ProcessUtilization, instead use 100-SystemIdle, as SystemIdle does seem to reliably be 100-based. Unfortunately this lumps OS load into there, but that seems better than having nothing or something completely wrong.

chadbaldwin commented 3 weeks ago

UPDATE: I guess I've got blinders on and I have been looking at this problem for too long. I don't think there is any need to look at process_physical_affinity. As it seems when virtual_machine_type_desc is NONE then the number of sections in process_physical_affinity is equal to the socket count.