influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.56k stars 5.56k forks source link

Need a PluginErrorHandler interface or something that telegraf can callback when it fails to poll for metrics. #10245

Open batistado opened 2 years ago

batistado commented 2 years ago

Feature Request

Need a PluginErrorHandler interface or something that telegraf can callback when it fails to poll for metrics. For more context please check out this thread: https://influxcommunity.slack.com/archives/CH7C56X4Z/p1638920766176700

Proposal:

A new interface something like this: So an interface something like this:

type PluginErrorHandler interface {
    OnError(err error)
}

would be really useful. https://influxcommunity.slack.com/archives/CH7C56X4Z/p1638984531179900?thread_ts=1638920766.176700&cid=CH7C56X4Z

Current behavior:

On this line: https://github.com/influxdata/telegraf/blob/34ad5aa137f0d845d79826839379cb543c3879c5/agent/agent.go#L451 I see the error being added to the accumulator and the accumulator just logs the error.

Desired behavior:

The accumulator should also call the OnError method of the PluginErrorHandler interface in addition to logging the error.

Use case:

We rely on telegraf heavily to poll MSSQL perf metrics and in case of any errors with polling we would like to capture errors and send it to downstream services for processing. An example would be MS SQL Server browser not running. If such error is reported, we want our system to attempt enabling the sql server browser. For complex errors we could surface these errors in our health check portal.

srebhan commented 2 years ago

@batistado for clarification. Do you want to have this OnError callback to live in each plugin or do you want this to be global over all plugins?

batistado commented 2 years ago

@srebhan I would recommend you have support for both but make it optional. So those who want to get notified on every error for any plugin, they can configure global error handler and those who want it from a specific set of plugins can configure handlers on those plugins.

srebhan commented 2 years ago

@batistado from what I understand, you want to do something like this (pseudocode)

    func myErrorHandler(name string, err error) {
        // Send the received error somewhere
       ...
    }

    ...

    for _, input := range config.RunningInputs {
        if filter.Match(input.LogName()) {
            input.RegisterErrorHandler(myErrorHandler)
        }
    }
    // Same for all other plugin types...

Is this correct?

srebhan commented 2 years ago

@batistado one more comment. I think this can be done already. If you use the tail plugin to monitor your Telegraf log file, you can easily parse errors (or warnings etc) and then use an output plugin (e.g. http) to push that error/warning/whatsoever somewhere. Would that work for you?

batistado commented 2 years ago

@batistado from what I understand, you want to do something like this (pseudocode)

    func myErrorHandler(name string, err error) {
        // Send the received error somewhere
       ...
    }

    ...

    for _, input := range config.RunningInputs {
        if filter.Match(input.LogName()) {
            input.RegisterErrorHandler(myErrorHandler)
        }
    }
    // Same for all other plugin types...

Is this correct?

Yeah something of this sort.

batistado commented 2 years ago

@batistado one more comment. I think this can be done already. If you use the tail plugin to monitor your Telegraf log file, you can easily parse errors (or warnings etc) and then use an output plugin (e.g. http) to push that error/warning/whatsoever somewhere. Would that work for you?

Parsing log error messages is a bad developer experience. Also, the code will have to unnecessarily process non error string messages too and figure out if it's error or other kind of log. And this will also heavily depend on log string message format which could break things if you change your log messages. I would really only want error messages to be notified.

reimda commented 2 years ago

Hi @batistado, I'm not sure I completely understand your use case, but it sounds like you want a way to report errors in an input plugin, like when the sqlserver input fails to connect. You need those errors to be reported downstream through telegraf's output plugin so an external process can take action to fix the errors.

It seems like it should be possible to do this with internal statistics.

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/internal/README.md

see the section about "internal_"

If the sqlserver doesn't currently report connection errors through internal stats, that would be easy to add. Then if you enable the internal input, telegraf will report the errors through a metric like "internal_sqlserver connection_errors=1" You could watch for that field to increase and take action on it.

srebhan commented 2 years ago

@reimda I think the internal plugin is too coarse. From what I understand @batistado wants to know the kind of error (not only the count over all types) to e.g. restart some services on connection errors etc.

Hipska commented 2 years ago

During meeting and on Slack we discussed the possibility to have errors generated by plugins become a new metric like for example:

internal_error,plugin=inputs.http,alias=some_alias,method=Gather message="<actual error message>"

This opens the possibility to run processors and aggregators on it, and send it to the various output plugins as well. The use case of @batistado can be covered by this and many other options are possible as well..

reimda commented 2 years ago

Hi @batistado, do you still need this?

I still don't think I have a clear understanding of what you want. I just reread the slack thread. I understand you're writing an output plugin and you want it to be notified when an error is passed to the accumulator. The output will pass the error on downstream so an external process can fix whatever is causing the error.

What do you think about @Hipska's suggestion to improve the internal plugin so it includes the information you need? I would prefer to extend an existing mechanism than add a new interface that only one user needs.

Do you want to put a draft PR together to do this?

batistado commented 2 years ago

Hey @reimda,

You are right. We want an interface on the output plugin that can emit errors which can be used by our app to attempt an action to fix or mitigate the error. I am not super familiar with all the telegraf features so I don't know if we can programmatically access the emitted metric. Please let me know.

Hipska commented 2 years ago

Yes, the metrics can be sent to a http or socket (tcp/UDP) port or via an external tool/script with exec(d) plugin.

goswamisandeep commented 1 year ago

We do have a scenario where we do report to user whether metric collection fails because of :-

  1. TCP connection issue
  2. Username / password is wrong
  3. PERMISSION issue

I do see telegraf do report them in log files ,

Credentials Error

2022-11-12T09:42:09Z E! [inputs.sqlserver] Error in plugin: query SQLServerAvailabilityReplicaStates failed for server: 10.53.117.51 and database: <empty-database-name> with Msg 18456, Level 14, State 1:, Line 1, Error: mssql: login error: Login failed for user 'sa'.

Connection Error

2022-11-12T09:38:57Z E! [inputs.sqlserver] Error in plugin: query SQLServerDatabaseReplicaStates failed for server: 10.53.117.51 and database: <empty-database-name> with Error: unable to open tcp connection with host '10.53.117.51:1439': dial tcp 10.53.117.51:1439: connect: connection refused

Permission Issue

2022-11-12T09:53:31Z E! [inputs.sqlserver] Error in plugin: query SQLServerPerformanceCounters failed for server: 10.53.117.51 and database: <empty-database-name> with Msg 297, Level 16, State 1:, Line 23, Error: mssql: The user does not have permission to perform this action

Is there a way to propagate these errors/let the user know whether metric collection fails because of connection/credentials/permissions issue.

Hipska commented 1 year ago

This is also a valid use-case that would be solved with this feature request. Currently it is not possible, but we are looking for help for someone to implement this.