influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.57k forks source link

[SQL Server Input plugin] Configurable Query Timeout #9377

Closed bhsu-ms closed 3 years ago

bhsu-ms commented 3 years ago

Feature Request

Introduce configurable timeout for queries in SQL plugin to prevent telegraf queries blocking SQL server resources.

Proposal:

There is no timeout for telegraf queries in SQL server plugin today. Telegraf could potentially trigger long running queries and cause performance issues. Introducing timeouts in queries will prevent long running queries from happening. There could be many reasons why even a simple DMV query gets blocked or executes much slower than expected, so a timeout would be a good thing in general. This timeout is configurable. If no timeout is specified for the connection string, the default timeout will be 30s. Customers can also set the timeout to some value that fits their scenario.

Current behavior:

No timeout for SQL server plugin queries. Telegraf could have long running queries that cause performance issues.

Desired behavior:

  1. If the user does not specify query timeout value, we set the query timeout to 30s.
  2. If the user specifies a different timeout value in telegraf.conf, we use that value as query timeout.

Use case:

Timeout could prevent telegraf from introducing long running queries causing performance issues.

I'm willing to make the change, but wanted to gather feedback before I create a PR.

bhsu-ms commented 3 years ago

@ssoroka @srebhan @Trovalo @denzilribeiro - Please let me know your thoughts. @masree @coquagli - FYI

srebhan commented 3 years ago

@bhsu-ms thanks for opening this discussion, much appreciated. Let me first check-back if I understood your question correctly... We are talking about the "new" generic sql plugin in plugins/inputs/sql and your proposal is to change the timeout handling in Gather() to a timeout per query. Is this correct?

As currently the timeout is used per-query in preparing the statement anyway, I'm open to simply make the timeout per-query in Gather() and change the value. I guess it is save to expect that establishing the connection to the server (the third place we use the timeout) is usually quicker than preparation or queries anyway?

Trovalo commented 3 years ago

In the SQL Server plugin, you can actually set the Query timeout, but in the connection string itself. there are lots of parameters just look at the driver docs.

as a sample: server=SqlInstance01;database=master;connection timeout=10

might be nice to centralize timeout too, there is a similar proposal for the app name in #8810

srebhan commented 3 years ago

@Trovalo the sql input plugin support multiple different "drivers" not just mssql. I'm not sure if this is a generic thing required to be supported by the driver. Do you know it's save to rely on this in e.g. mysql, postres or sqlite drivers? Anyway I would like to provide an option in telegraf that makes sure that we do not block forever, no matter what the driver is doing...

denzilribeiro commented 3 years ago

In the SQL Server plugin, you can actually set the Query timeout, but in the connection string itself. there are lots of parameters just look at the driver docs.

as a sample: server=SqlInstance01;database=master;connection timeout=10

might be nice to centralize timeout too, there is a similar proposal for the app name in #8810

@Trovalo this is Connection timeout not Query timeout, what the ask is here is a query timeout, say specific query gets blocked or whatever by workload, is better for monitoring query to timeout than to create any contention in the system .

@bhsu-ms assuming something like this https://www.alexedwards.net/blog/how-to-manage-database-timeouts-and-cancellations-in-go is what you were thinking or https://www.jajaldoang.com/post/sql-query-timeout-with-golang-context/ ?

srebhan commented 3 years ago

I think we are drifting off a bit on the original issue. If I'm not mistaken, @bhsu-ms wants to change those lines in plugins/inputs/sql to something like

    tstart := time.Now()
    for _, query := range s.Queries {
        wg.Add(1)
        go func(q Query) {
            defer wg.Done()
            ctx, cancel := context.WithTimeout(context.Background(), time.Duration(s.Timeout))
            defer cancel()
            if err := s.executeQuery(ctx, acc, q, tstart); err != nil {
                acc.AddError(err)
            }
        }(query)
    }

@denzilribeiro we already have a timeout (see the link above), but currently it is over all queries. This means that if one query takes exceptionally long, all following queries will be canceled/dropped due to the timeout...

denzilribeiro commented 3 years ago

@srebhan missed that totally thanks :) , will let @bhsu-ms clarify his proposal if it was more than this aka per query timeout? or...

bhsu-ms commented 3 years ago

@srebhan @Trovalo @denzilribeiro Thanks for the feedback! My original proposal was to have timeout for per query. I missed the lines @srebhan mentioned above. I'm open to whether having the timeout over all queries or per query. As @srebhan mentioned, all following queries will be dropped due to timeout when there's one long running query.

srebhan commented 3 years ago

@bhsu-ms the more I think about it the more it makes sense to have the timeout per query. So please feel free to submit a PR and assign me.

Trovalo commented 3 years ago

I'll refer to the SQL Server input plugin only, as this is what's reported in the title of the request, if you want to implement it in the generic SQL plugin I already thank you @srebhan as that's will surely be useful. What I'll report next might help you implement it for the SQL server driver.

@Trovalo this is Connection timeout not Query timeout, what the ask is here is a query timeout, say specific query gets blocked or whatever by workload, is better for monitoring query to timeout than to create any contention in the system .

@denzilribeiro The parameter connection timeout is not actually just the connection timeout, it's the total time before the query gets canceled. The driver is not that clear about it but literally says:

connection timeout - in seconds (default is 0 for no timeout), set to 0 for no timeout. Recommended to set to 0 and use context to manage query and connection timeouts.

As you can see, it's not suggested to use the connection string parameter itself, but the "context", I had a look around and found an old Issue #2824 in which this was discussed, but never implemented/merged.

From my experience, I can tell you that using the connection timeout parameter is extremely useful because it allows you to not lose all the gathered data.

as a sample, let's say that:

With timeout 10sec→ server=SqlInstance01;connection timeout=10

Without timeout → server=SqlInstance01

Just for completeness, a whole gathering round (input + processor ,etc + outputs) must be completed within the set interval, so what I usually do is set the connection timeout to circa 50% of the interval, most common case for me is interval = '15s' and connection timeout=8.