This PR adds a feature to automatically disable failing source collectors, with each collector having its own individual configuration for this behavior.
By default, the error tolerance feature is disabled, as we do not want to change the default behavior of the program, but it can easily be enabled for specific source collectors using the new configurations options introduced in the [source_collector.<name>] table of the configuration file:
error_tolerance (default is 0) to set the maximum number of tolerated errors before a collector is disabled. If set to 0, disablement occurs on first error.
error_duration (default is 360) number of seconds an error should be stored before it is discarded. This number must be greater than or equal to error_tolerance * update_interval in order to properly detect errors, otherwise the tolerance will never be exceeded. For example, if error_tolerance=5 and update_interval=60, error_duration=300 is the minimum valid value (5 * 60).
exit_on_error (default is false) to decide whether the application should terminate or disable the failing collector when the number of errors exceed tolerance.
disable_duration (default is 3600) to set disablement period in seconds for the collector once disabled. After this period, the collector is enabled again. If set to 0, the collector will remain disabled until the application is restarted.
If error_tolerance errors occur within error_duration seconds, the collector will be disabled. Errors older than error_duration seconds will be discarded, allowing recovery if failures are temporary. Source collectors must opt-in to this feature by setting error_tolerance to a non-zero value, if not, the first error that occurs triggers disablement.
The exit_on_error setting offers an option to terminate the application instead of disabling the collector.
Architectural changes
Furthermore, configuration management inside source collector processes has been changed. Instead of dictionaries, we now pass in SourceCollectorSettings objects as the config argument for SourceCollectorProcess, providing better IDE autocomplete and aiding in static analysis. The keyword arguments passed to collect are now stripped of the keys corresponding to SourceCollectorSettings fields, so that only extra arguments are passed to the collector function.
Future Work
It would be beneficial to implement exponential backoff for the disablement period, so that we can retry more often when initial disablement occurs, and then exponentially increase the disablement period on repeated errors.
OR
Implement a dynamic adjustment of the source collector's update interval using exponential backoff when errors occur. This means we'll retry frequently at first, then gradually reduce the frequency, until we reach the error tolerance limit. Beyond this point, the source collector will be disabled for a set duration. Although this method mimics traditional retry functionality more closely than option 1, it complicates the update interval configuration since initial retries should probably be faster than this value. Using exponential backoff with a long update interval like 10 minutes could result in extended wait times unless the backoff factor is kept low. This factor would need to be a configurable option or dynamically determined by the program.
This PR adds a feature to automatically disable failing source collectors, with each collector having its own individual configuration for this behavior.
By default, the error tolerance feature is disabled, as we do not want to change the default behavior of the program, but it can easily be enabled for specific source collectors using the new configurations options introduced in the
[source_collector.<name>]
table of the configuration file:error_tolerance
(default is0
) to set the maximum number of tolerated errors before a collector is disabled. If set to 0, disablement occurs on first error.error_duration
(default is360
) number of seconds an error should be stored before it is discarded. This number must be greater than or equal toerror_tolerance * update_interval
in order to properly detect errors, otherwise the tolerance will never be exceeded. For example, iferror_tolerance=5
andupdate_interval=60
,error_duration=300
is the minimum valid value (5 * 60).exit_on_error
(default isfalse
) to decide whether the application should terminate or disable the failing collector when the number of errors exceed tolerance.disable_duration
(default is3600
) to set disablement period in seconds for the collector once disabled. After this period, the collector is enabled again. If set to0
, the collector will remain disabled until the application is restarted.If
error_tolerance
errors occur withinerror_duration
seconds, the collector will be disabled. Errors older thanerror_duration
seconds will be discarded, allowing recovery if failures are temporary. Source collectors must opt-in to this feature by settingerror_tolerance
to a non-zero value, if not, the first error that occurs triggers disablement.The
exit_on_error
setting offers an option to terminate the application instead of disabling the collector.Architectural changes
Furthermore, configuration management inside source collector processes has been changed. Instead of dictionaries, we now pass in
SourceCollectorSettings
objects as theconfig
argument forSourceCollectorProcess
, providing better IDE autocomplete and aiding in static analysis. The keyword arguments passed tocollect
are now stripped of the keys corresponding toSourceCollectorSettings
fields, so that only extra arguments are passed to the collector function.Future Work
OR