influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.56k stars 5.56k forks source link

inputs.sqlserver plugin: Azure SQL Hyperscale TLOG check results in Arithmetic overflow error #13862

Closed jkutikacn closed 1 year ago

jkutikacn commented 1 year ago

Relevant telegraf.conf

[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "10s"
  flush_jitter = "0s"
  hostname = "$HOSTNAME"
  interval = "10s"
  logfile = ""
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = true
[[processors.enum]]
   [[processors.enum.mapping]]
    dest = "status_code"
    field = "status"
    [processors.enum.mapping.value_mappings]
        critical = 3
        healthy = 1
        problem = 2

[[outputs.prometheus_client]]
  listen = ":9273"
  path = "/metrics"

[[inputs.sqlserver]]
  auth_method = "AAD"
  database_type = "AzureSQLDB"
  exclude_query = [
    "AzureSQLDBSchedulers",
    "AzureSQLDBRequests"
  ]
  interval = "30s"
  servers = [ "Server=azuresqldb.database.windows.net;Port=1433;database=mydb1;hostNameInCertificate=*.database.windows.net;TrustServerCertificate=true;app name=telegraf;log=1;",
    "Server=azuresqldb.database.windows.net;Port=1433;database=mydb2;hostNameInCertificate=*.database.windows.net;TrustServerCertificate=true;app name=telegraf;log=1;"
  ]

[[inputs.internal]]
  collect_memstats = false

Logs from Telegraf

2023-09-05T08:24:30Z E! [inputs.sqlserver] Error in plugin: mssql: Arithmetic overflow error converting expression to data type int

System info

1.27.0

Docker

No response

Steps to reproduce

  1. configure inputs.sqlserver for big azure sql hyperscale database (our DB is 23TB in size)
  2. run telegraf to collect telemetry from the database
  3. TLOG size check will return Arithmetic overflow error if TLOG size is big (our TLOG is currently 1008 GB)
  4. checking line 246 in [azuresqldbqueries.go] (https://github.com/influxdata/telegraf/blob/master/plugins/inputs/sqlserver/azuresqldbqueries.go) show that this is checking size of TLOG and when executed manually via SSMS I get the same error message as in telegraf.

Expected behavior

Plugin shoud be adopted to hyperscale azure sql tier, which allows DB size up to 100TB and TLOG size is virtually unlimited, therefore limiting the possibilities of monitoring TLOG size makes no sense

Actual behavior

If TLOG size does not fit into integer range error is thrown Arithmetic overflow error converting expression to data type int

image

Additional info

line 246 of the plugin needs to be modified, for instance like this, afterwards the statement can be successfully executed on the database (test only via SSMS)

,(SELECT SUM(CAST (max_size as BIGINT)) * 8 / (1024 * 1024) FROM sys.database_files WHERE type_desc = 'LOG') AS max_log_mb

image

powersj commented 1 year ago

Hi @jkutikacn,

Thanks for the issue. It looks like the diff you are proposing is https://github.com/influxdata/telegraf/blob/master/plugins/inputs/sqlserver/azuresqldbqueries.go#L246:

--SUM(max_size)
++SUM(CAST (max_size as BIGINT))

@Trovalo thoughts on the change?

Trovalo commented 1 year ago

I approve it, we shouldn't go in arithmetic overflow

powersj commented 1 year ago

@jkutikacn,

I've put up https://github.com/influxdata/telegraf/pull/13870 with your change. Once artifacts are available, can you confirm the fix please? Thanks!

jkutikacn commented 1 year ago

Hi @powersj, I can confirm the fix works! Thanks a lot and cheers!