influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.62k stars 5.58k forks source link

sqlserver: not all expected metrics collected when using database_type = "SQLServer" #13724

Open leepfrog-ger opened 1 year ago

leepfrog-ger commented 1 year ago

Relevant telegraf.conf

[[inputs.sqlserver]]
  interval = "30s"
  servers = [
    "Server=server1;Port=1433;User Id=telegraf;Password=XXXXX;app name=telegraf;log=1;",
  ]

  database_type = "SQLServer"
  health_metric = true
  auth_method = "connection_string"

Logs from Telegraf

2023-08-04T15:30:49Z I! Starting Telegraf 1.27.3
2023-08-04T15:30:49Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-08-04T15:30:49Z I! Loaded inputs: sqlserver
2023-08-04T15:30:49Z I! Loaded aggregators: 
2023-08-04T15:30:49Z I! Loaded processors: 
2023-08-04T15:30:49Z I! Loaded secretstores: 
2023-08-04T15:30:49Z I! Loaded outputs: file influxdb
2023-08-04T15:30:49Z I! Tags enabled: host=XXX
2023-08-04T15:30:49Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"XXX", Flush Interval:10s
2023-08-04T15:30:49Z I! [inputs.sqlserver] Config: database_type: SQLServer , query_version:0 , azuredb: false
2023-08-04T15:30:49Z I! [inputs.sqlserver] Config: Effective Queries: []string{"SQLServerDatabaseIO", "SQLServerProperties", "SQLServerMemoryClerks", "SQLServerVolumeSpace", "SQLServerAvailabilityReplicaStates", "SQLServerRecentBackups", "SQLServerPerformanceCounters", "SQLServerWaitStatsCategorized", "SQLServerSchedulers", "SQLServerRequests", "SQLServerCpu", "SQLServerDatabaseReplicaStates"}

System info

Telegraf 1.27.3

Docker

No response

Steps to reproduce

  1. Run the default config of the sqlserver plugin with database_type = "SQLServer" configured against a server that has resource governor (Workload Groups/Resource Pools) configured
  2. Do the same with configuring query_version = 2
  3. Compare the results

Expected behavior

Results should be similar, there should be no metrics missing when using the new style queries (database_type = "SQLServer")

Actual behavior

Certain values are not present. For example for the object SQLServer:Workload Group Stats with query_version = 2 the following counters are returned:

with database_type = "SQLServer" only these are returned:

Additional info

This seems to be due to the queries itself, I get the same mismatch when running

https://github.com/influxdata/telegraf/blob/2ac45b8d25629642ded419e7ce079ae8427a5a98/plugins/inputs/sqlserver/sqlqueriesV2.go#L406

and

https://github.com/influxdata/telegraf/blob/2ac45b8d25629642ded419e7ce079ae8427a5a98/plugins/inputs/sqlserver/sqlserverqueries.go#L333

and compare the output.

If this is intended (or the new queries are known to not yet be fully compatible with the v2 ones) I would suggest to document this somewhere.

powersj commented 1 year ago

The docs already mention breaking compatibility in the sample config:

  ## Optional parameter, setting this to 2 will use a new version
  ## of the collection queries that break compatibility with the original
  ## dashboards.
  ## Version 2 - is compatible from SQL Server 2012 and later versions and also for SQL Azure DB
  # query_version = 2

The reason we put a version config value is when compatibility is broken or changed. Hence v1 != v2.

On top of that, there is also the database_type, which is the suggested way to start using this plugin:

## "database_type" enables a specific set of queries depending on the database type.
leepfrog-ger commented 1 year ago

I am aware of this, as described the incompatibility happens between v2 and database_type, so this is unrelated to v1.

powersj commented 1 year ago

The answer is the same, they are different things. I am happy to have you provide a suggestion or PR to make this more clear.

leepfrog-ger commented 1 year ago

Right now the documentation suggests that no change is to be expected between the two in specific sections.

Specifically the describtion of resource governor metrics (where the mismatch occurs) is identical between query_version=2 and database_type = "SQLServer":

Resource Governor: CPU Usage, Requests/sec, Queued Requests, and Blocked tasks per workload group + more

The pull request where the change was introduced even mentiones the outputs should not change:

Fundamental principals [...] b. Maintain the same measures if the underlying core DMV is the same.

I am happy to contribute clarifying information for the documentation as PR, but before that it should be established whether this this difference is intended (then is should be documented) or a bug (then it should be fixed).

@Trovalo: maybe you can chime in, seeing that you did a lot of work for this input in the past and actively discussed regarding the PR in question?

Trovalo commented 1 year ago

Just got back form vacation, I'll look into it

Trovalo commented 1 year ago

I had a look at it and all the mismatched counters were from the resource governor section that are not fetched anymore, they have been omitted or forgotten during migration. (can't tell if it was on purpose or not, but the original idea was to migrate stuff and keep the output as-is) Those are not real PerfCounters (you won't find them in dm_os_performance_counters and came from dm_resource_governor_workload_groups) and were generated separately.

https://github.com/influxdata/telegraf/blame/2ac45b8d25629642ded419e7ce079ae8427a5a98/plugins/inputs/sqlserver/sqlqueriesV2.go#L591

Since they are not actual PerfCounters I'm kind of fine with them being left out... even tho they could be useful (even tho I've seen no complaints so far) is it an issue to add them back? not at all

So the question is pretty simple, do you want them back?

I've also spotted a pair of missing counters by comparing v2 and current versions: 'Free pages','Extension free pages' (but both are not useful to me, more like infos... you can't really use them to monitor anomalies and I'd go on without them as I've seen no complaints)

leepfrog-ger commented 1 year ago

Thanks for taking a look at it!

For us these metrics are critical as we rely on them to track SQL CPU usage of different users and applications. We only noticed this now because we've started migrating from the ancient telegraf version we've been using. So if I could freely choose of course I'd prefer to have them available in the "new" mode where future development will take place.

That being said it seems like nobody did miss them until now - so I could understand if it is not worth the effort for you.

Trovalo commented 1 year ago

Don't worry that won't require a big effort, I (or anyone) just needs to add the bit of SQL that has been removed/omitted

On Sat, Aug 26, 2023, 00:06 leepfrog @.***> wrote:

Thanks for taking a look at it!

For us these metrics are critical as we rely on them to track SQL CPU usage of different users and applications. We only noticed this now because we've started migrating from the ancient telegraf version we've been using. So if I could freely choose of course I'd prefer to have them available in the "new" mode where future development will take place.

That being said it seems like nobody did miss them until now - so I could understand if it is not worth the effort for you.

— Reply to this email directly, view it on GitHub https://github.com/influxdata/telegraf/issues/13724#issuecomment-1693981963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC5J5KT4SMTUZWIWPISARM3XXEOURANCNFSM6AAAAAA3EODDWI . You are receiving this because you were mentioned.Message ID: @.***>

Trovalo commented 5 months ago

@leepfrog-ger is this still relevant to you?

leepfrog-ger commented 5 months ago

Thanks for asking, yes, I am still using query_version = 2 as this gives us the required fields