Closed bhsu-ms closed 3 years ago
@ssoroka @srebhan @Trovalo @denzilribeiro - Please let me know your thoughts. @masree @coquagli - FYI
@bhsu-ms thanks for opening this discussion, much appreciated. Let me first check-back if I understood your question correctly... We are talking about the "new" generic sql plugin in plugins/inputs/sql
and your proposal is to change the timeout handling in Gather()
to a timeout per query. Is this correct?
As currently the timeout is used per-query in preparing the statement anyway, I'm open to simply make the timeout per-query in Gather()
and change the value.
I guess it is save to expect that establishing the connection to the server (the third place we use the timeout) is usually quicker than preparation or queries anyway?
In the SQL Server plugin, you can actually set the Query timeout, but in the connection string itself. there are lots of parameters just look at the driver docs.
as a sample:
server=SqlInstance01;database=master;connection timeout=10
might be nice to centralize timeout too, there is a similar proposal for the app name
in #8810
@Trovalo the sql
input plugin support multiple different "drivers" not just mssql
. I'm not sure if this is a generic thing required to be supported by the driver. Do you know it's save to rely on this in e.g. mysql
, postres
or sqlite
drivers?
Anyway I would like to provide an option in telegraf that makes sure that we do not block forever, no matter what the driver is doing...
In the SQL Server plugin, you can actually set the Query timeout, but in the connection string itself. there are lots of parameters just look at the driver docs.
as a sample:
server=SqlInstance01;database=master;connection timeout=10
might be nice to centralize timeout too, there is a similar proposal for the
app name
in #8810
@Trovalo this is Connection timeout not Query timeout, what the ask is here is a query timeout, say specific query gets blocked or whatever by workload, is better for monitoring query to timeout than to create any contention in the system .
@bhsu-ms assuming something like this https://www.alexedwards.net/blog/how-to-manage-database-timeouts-and-cancellations-in-go is what you were thinking or https://www.jajaldoang.com/post/sql-query-timeout-with-golang-context/ ?
I think we are drifting off a bit on the original issue. If I'm not mistaken, @bhsu-ms wants to change those lines in plugins/inputs/sql
to something like
tstart := time.Now()
for _, query := range s.Queries {
wg.Add(1)
go func(q Query) {
defer wg.Done()
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(s.Timeout))
defer cancel()
if err := s.executeQuery(ctx, acc, q, tstart); err != nil {
acc.AddError(err)
}
}(query)
}
@denzilribeiro we already have a timeout (see the link above), but currently it is over all queries. This means that if one query takes exceptionally long, all following queries will be canceled/dropped due to the timeout...
@srebhan missed that totally thanks :) , will let @bhsu-ms clarify his proposal if it was more than this aka per query timeout? or...
@srebhan @Trovalo @denzilribeiro Thanks for the feedback! My original proposal was to have timeout for per query. I missed the lines @srebhan mentioned above. I'm open to whether having the timeout over all queries or per query. As @srebhan mentioned, all following queries will be dropped due to timeout when there's one long running query.
@bhsu-ms the more I think about it the more it makes sense to have the timeout per query. So please feel free to submit a PR and assign me.
I'll refer to the SQL Server input plugin only, as this is what's reported in the title of the request, if you want to implement it in the generic SQL plugin I already thank you @srebhan as that's will surely be useful. What I'll report next might help you implement it for the SQL server driver.
@Trovalo this is Connection timeout not Query timeout, what the ask is here is a query timeout, say specific query gets blocked or whatever by workload, is better for monitoring query to timeout than to create any contention in the system .
@denzilribeiro The parameter connection timeout
is not actually just the connection timeout, it's the total time before the query gets canceled. The driver is not that clear about it but literally says:
connection timeout - in seconds (default is 0 for no timeout), set to 0 for no timeout. Recommended to set to 0 and use context to manage query and connection timeouts.
As you can see, it's not suggested to use the connection string parameter itself, but the "context", I had a look around and found an old Issue #2824 in which this was discussed, but never implemented/merged.
From my experience, I can tell you that using the connection timeout
parameter is extremely useful because it allows you to not lose all the gathered data.
as a sample, let's say that:
With timeout 10sec→ server=SqlInstance01;connection timeout=10
Without timeout → server=SqlInstance01
interval
. once the interval is reached, and a new round of gathering should start, you will get an error like: W! [agent] input "inputs.sqlserver" did not complete within its interval
Just for completeness, a whole gathering round (input + processor ,etc + outputs) must be completed within the set interval
, so what I usually do is set the connection timeout
to circa 50% of the interval, most common case for me is interval = '15s'
and connection timeout=8
.
Feature Request
Introduce configurable timeout for queries in SQL plugin to prevent telegraf queries blocking SQL server resources.
Proposal:
There is no timeout for telegraf queries in SQL server plugin today. Telegraf could potentially trigger long running queries and cause performance issues. Introducing timeouts in queries will prevent long running queries from happening. There could be many reasons why even a simple DMV query gets blocked or executes much slower than expected, so a timeout would be a good thing in general. This timeout is configurable. If no timeout is specified for the connection string, the default timeout will be 30s. Customers can also set the timeout to some value that fits their scenario.
Current behavior:
No timeout for SQL server plugin queries. Telegraf could have long running queries that cause performance issues.
Desired behavior:
Use case:
Timeout could prevent telegraf from introducing long running queries causing performance issues.
I'm willing to make the change, but wanted to gather feedback before I create a PR.