Closed j-bai closed 3 years ago
probable cause of the issue:
If error occurs while sending the http request ( e.g. src/http.rs line 109) a user defined error of type CallRequest
is raised, which is a wrapped around hyper::Error
using snafu. This error is of type HttpError
.
The error is caught at (src/sinks/util/retries.rs line 124) where error.downcast_ref::<L::Error>()
tries to get the type of error. If the type is hyper::Error
(in case http sink) then retry happens, but we wrapped the error to be CallRequest , so generic error handling line 140, is done and code exits instead of retrying.
This behavior also seems to be the case for the docker image for 0.11.0 (for the loki sink, but fails in the HTTP request section as well), Using 0.10, I get this sort of error if it catches it just as Loki is started again.
Feb 07 03:09:11.651 WARN sink{name=loki_standard type=loki}:request{request_id=0}: vector::sinks::util::retries2: retrying after error: error trying to connect: tcp connect error: Connection refused (os error 111)
Feb 07 03:09:16.657 ERROR sink{name=loki_standard type=loki}:request{request_id=0}: vector::sinks::util::retries2: encountered non-retriable error. error=connection error: Connection reset by peer (os error 104)
Feb 07 03:09:16.658 ERROR sink{name=loki_standard type=loki}:request{request_id=0}: vector::sinks::util::sink: Request failed. error=connection error: Connection reset by peer (os error 104)`
I'm hoping that putting this behind an authenticating web proxy will reduce the chance that this can cause an error outside of explicit retry testing... but it would be nice for total failures to be retried. Will definitely need to tune the configuration file so that in case of outage - I can find out what errors did not make it into storage... at which point I'll have to figure out an efficient mechanism to re-inject the missed log messages.
created a PR to fix retry issue in v0.11.1
branch : https://github.com/timberio/vector/pull/6454
We just ran into this issue in our staging environment for ElasticSearch sink. I was in the middle of reproducing this when I see a fix submitted! 💯 Would like to see this fix included in the 0.12
release, otherwise it would create a lot of production issues with randomly dropped logs I think.
Just to add some context. From what I've observed when trying to reproduce this bug. This happened for us when an established connection is severed. The next request trying to use the severed connection will just fail and cause the events to be dropped without any retry.
P.S. also this doesn't appear to be a regression. same problem exists in 0.11.0
as well.
Hi @drunkirishcoder
Thanks for the comment.
I branched from v0.11.1 branch, and haven't looked into the 0.12 changes yet (plannnig to do that next, probably this weekend).
I believe the issue was introduced with this commit
And, you are right, the retry fails when the error happens on vector itself. It can be failure while establishing connection.
The retry logic still works if we get some retryable http error code like 500
Hi @drunkirishcoder update : I applied the code changes in master branch, and it still works. I haven't tested the elasticsearch and clickhouse sinks, as I don't have the necessary setup, but they use same http logic to send logs.
Vector Version
Vector Configuration File
Debug Output
Expected Behavior
Http sink should respect the retry policy defined and retry sending requests when remote http server got offline.
Actual Behavior
Http sink didn't retry sending requests at all when remote http server got offline. Instead, it emitted error messages (see below) and dropped requests immediately.
Example Data
With vector v0.11.1, when typing anything in the console, following error messages would be emitted by vector:
As a comparison, vector v0.11.0 would emit following log messages instead:
Additional Context
References