marcingminski / sqlwatch

SQL Server Performance Monitor
https://docs.sqlwatch.io
Other
432 stars 170 forks source link

Using CLR to obtain Performance Counters #256

Closed marcingminski closed 3 years ago

marcingminski commented 3 years ago

I am looking at moving some of the code that runs checks and actions to CLR. Perhaps even code that collects performance counters from the OS. With the broker queues, this would make the application completely contained within the SQL database without any agent jobs.

However, as a DBA I never liked CLR as I was of the opinion that application code is expensive to run in SQL (CPU based licensing) and it should run in the application tier and databases should run only database code. In SQLWATCH however, we have no application tier and writing full blown .NET app is not an easy task. CLR would increase performance of the code where we do a lot of looping, cursors and comparison logic (checks and actions). https://docs.microsoft.com/en-us/sql/relational-databases/clr-integration/clr-integration-architecture-performance?view=sql-server-ver15

As a DBA, I was also wary of CLR because I was never sure what was in the stack and would not want to run an untrusted piece of software in the production tier. However, this is all open source so everyone can see what's in it.

Looking forward to see your input but please don't just say that you do not like it.

marcingminski commented 3 years ago

clr function to read performance counters: https://github.com/marcingminski/sqlwatch/commit/1870cadee01006ca2fb9313e1dadc03fc46adace#diff-298ba004840a427494460a0c876ddf3b63e0df9d3dfc067b80538842f84d7b16

example to get OS performance counter using TSQL:

select cntr_value=convert(real,[dbo].[ReadPerformanceCounterFormatted] ('Processor','% Processor Time','_Total'))
marcingminski commented 3 years ago

The problem with reading Performance Counters via CLR function is that the function is invoked for every data row - this is not a performance problem as such - whilst invoking SQL function for every row can be considered bad coding, doing the same in CLR is what CLR is designed for. However, the problem is with the PerformnaceCounter class:

  1. Reading calculated value i.e. in a human friendly format such as CPU utilisation in percentage requires two readings. Read data, wait few ms and read it again and return to the client. This becomes a problem, as we said before, we are invoking CLR for each data row and for many counters = many data rows, each would have to wait few ms adding to the total time of the execution of the SQL Procedure. In other words, the more counters we have the longer it will wait. 10 counters, each waiting 100ms would add up to 1s already.
  2. Reading RAW values is instant but requires a lot of supporting data to calculate the actual value, it is not enough to just calculate delta, we often need values such as time ticks, CPU ticks etc, this would require changes to the schema to accommodate those values and then lots of code to do the calculation.

Moreover, measuring CPU utilisation is a challenging task. The CPU utilisation (%) is the percentage of the time the CPU spends doing a work; therefore, it requires observation over a period. Suppose we are observing CPU for a period of 1 second, and we notice that it was doing some work for a 100ms and was idle for 900ms. We could then say the CPU was 10% busy in our sample. However, if we only observed it for 100ms, and it was doing work for the entire 100ms, we would say the CPU was busy 100%. Analogically, if we were observing the same CPU for 10 seconds, the same 100ms would become 1%. In all those examples the CPU was busy the same amount of time, but we had three different readings. There is a good blog post about CPU utilisation on MSDN: https://social.technet.microsoft.com/wiki/contents/articles/12984.understanding-processor-processor-time-and-process-processor-time.aspx

Even runnig typeperf.exe with different intervals gives different readings.

1 second interval. You are likely to notice many more spikes:

typeperf "\Processor(_Total)\% Processor Time" -si 1

5 second interval. This should be smoother:

typeperf "\Processor(_Total)\% Processor Time" -si 5

It would be a much better approach to approach it from the other angle - invoke a process (a CLR stored procedure for example) that reads and streams the counters back to the table continuously. This would be multithreaded so no issues with blocking operations.

Another, even better approach would be to query WMI as these returns calculated values instantly. They also use raw data and the same calculations as in the performance counters but all in the background threads.

Summarising, having done all this work, this may not actually go into the final product at all, and I may simply revert to using ring buffer to provide CPU utilisation. At the end of the day, we may not care about short-time variations but longer utilisation.

marcingminski commented 3 years ago

Another point brought to me was security audits and that auditors disallow CLR (as it seems auditors don't really know how most of the stuff works, they just follow a randomly collated "list of things that we do not understand so better turn it off"). It would be a shame to develop something that cannot be used because of auditors.