influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.5k stars 5.55k forks source link

SQLServer Input: Managed Identity token expired but not refreshed #11277

Open chrbrnracn opened 2 years ago

chrbrnracn commented 2 years ago

Relevant telegraf.conf

[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "10s"
  flush_jitter = "0s"
  hostname = "$HOSTNAME"
  interval = "10s"
  logfile = ""
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = true
[[processors.enum]]
   [[processors.enum.mapping]]
    dest = "status_code"
    field = "status"
    [processors.enum.mapping.value_mappings]
        critical = 3
        healthy = 1
        problem = 2

[[outputs.prometheus_client]]
  listen = ":9273"
  path = "/metrics"

[[inputs.sqlserver]]
  auth_method = "AAD"
  database_type = "AzureSQLDB"
  exclude_query = [
    "AzureSQLDBSchedulers",
    "AzureSQLDBRequests"
  ]
  interval = "30s"
  servers = [ "Server=azuresqldb.database.windows.net;Port=1433;database=mydb1;hostNameInCertificate=*.database.windows.net;TrustServerCertificate=true;app name=telegraf;log=1;",
    "Server=azuresqldb.database.windows.net;Port=1433;database=mydb2;hostNameInCertificate=*.database.windows.net;TrustServerCertificate=true;app name=telegraf;log=1;"
  ]

[[inputs.internal]]
  collect_memstats = false

Logs from Telegraf

2022-06-09T13:51:00Z E! [inputs.sqlserver] Error in plugin: query AzureSQLDBWaitStats failed for server: azuresqldb.database.windows.net and database: mydb1 with Msg 18456, Level 14, State 233:, Line 1, Error: mssql: login error: Login failed for user '<token-identified principal>'. Token is expired.
2022-06-09T13:51:00Z E! [inputs.sqlserver] Error in plugin: query AzureSQLDBResourceStats failed for server: azuresqldb.database.windows.net and database: mydb2 with Msg 18456, Level 14, State 233:, Line 1, Error: mssql: login error: Login failed for user '<token-identified principal>'. Token is expired.

System info

Telegraf 1.22.3 (git: HEAD ff950615)

Docker

No response

Steps to reproduce

  1. Setup Azure SQL DB, permit a Managed Identity as described in manual
  2. Setup Azure Linux VM, assign Managed Identity as "User Assigned"
  3. Start Telegraf
  4. After ~24 hours token is expired and messages show up in log ...

Expected behavior

Token is refreshed (shortly) before the expiration

Actual behavior

Token seems to be not refreshed

Additional info

In the master branch

When/how often is the Start function called?

reimda commented 2 years ago

Hi @chrbrnracn, thanks for the bug report.

Start is called once when the plugin is started: https://github.com/influxdata/telegraf/blob/f7aab29381798bc27d877bff238643246fa719a9/agent/agent.go#L265

It looks like this code came in with PR #8822

I'm not familiar with refreshing the token in this context. Is it something that Azure SQL Database requires regularly?

Are you able to prepare a fix and submit a PR for this?

chrbrnracn commented 2 years ago

Those tokens expire after usually 24 hours so it must be refreshed regularly. So the token refresh should not only run from Start but regularly, maybe before every update. The getTokenProvider function is >90% about checkig token expiry and getting new tokens. So maybe it would be sufficient to call this function on a regular basis.

I'm not a Go developer so unfortunately I can't prepare a fix for that. My findings where just from reading the code.

RussOBrienSea commented 1 year ago

Worrying that this bug (that we are now also experiencing) has had no resolution in over a year.

Is the sqlserver input mothballed ?