vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.23k stars 1.6k forks source link

Http sink of Vector v0.11.1 didn't retry sending requests when remote http server got offline #6282

Closed j-bai closed 3 years ago

j-bai commented 3 years ago

Vector Version

vector 0.11.1 (863709d x86_64-unknown-linux-gnu 2021-01-26)

Vector Configuration File

data_dir = "/tmp/vector"
[sources.test_in]
type = "stdin"

[sinks.test_out]
type = "http"
inputs = [
"test_in",
]
healthcheck = false
uri = "http://192.168.57.104:9000/"
encoding = "text"
request.retry_attempts = 5

Debug Output

Expected Behavior

Http sink should respect the retry policy defined and retry sending requests when remote http server got offline.

Actual Behavior

Http sink didn't retry sending requests at all when remote http server got offline. Instead, it emitted error messages (see below) and dropped requests immediately.

Example Data

With vector v0.11.1, when typing anything in the console, following error messages would be emitted by vector:

asdfasdf Jan 28 16:06:15.907 ERROR sink{component_kind="sink" component_name=test_out component_type=http}:request{request_id=0}: vector::sinks::util::retries: Unexpected error type; dropping the request. error=Failed to make HTTP(S) request Jan 28 16:06:18.739 ERROR sink{component_kind="sink" component_name=test_out component_type=http}:request{request_id=0}: vector::sinks::util::sink: Request failed. error=Failed to make HTTP(S) request

As a comparison, vector v0.11.0 would emit following log messages instead:

test Jan 28 21:58:34.969 WARN sink{component_kind="sink" component_name=test_out component_type=http}:request{request_id=0}: vector::sinks::util::retries: Retrying after error: error trying to connect: tcp connect error: Connection refused (os error 111) Jan 28 21:58:35.971 WARN sink{component_kind="sink" component_name=test_out component_type=http}:request{request_id=0}: vector::sinks::util::retries: Retrying after error: error trying to connect: tcp connect error: Connection refused (os error 111) Jan 28 21:58:36.972 WARN sink{component_kind="sink" component_name=test_out component_type=http}:request{request_id=0}: vector::sinks::util::retries: Retrying after error: error trying to connect: tcp connect error: Connection refused (os error 111) Jan 28 21:58:38.974 WARN sink{component_kind="sink" component_name=test_out component_type=http}:request{request_id=0}: vector::sinks::util::retries: Retrying after error: error trying to connect: tcp connect error: Connection refused (os error 111) Jan 28 21:58:41.976 WARN sink{component_kind="sink" component_name=test_out component_type=http}:request{request_id=0}: vector::sinks::util::retries: Retrying after error: error trying to connect: tcp connect error: Connection refused (os error 111) Jan 28 21:58:46.977 ERROR sink{component_kind="sink" component_name=test_out component_type=http}:request{request_id=0}: vector::sinks::util::retries: Retries exhausted; dropping the request. error=error trying to connect: tcp connect error: Connection refused (os error 111) Jan 28 21:58:46.977 ERROR sink{component_kind="sink" component_name=test_out component_type=http}:request{request_id=0}: vector::sinks::util::sink: Request failed. error=error trying to connect: tcp connect error: Connection refused (os error 111)

Additional Context

References

nishantmalhotra13 commented 3 years ago

probable cause of the issue:

If error occurs while sending the http request ( e.g. src/http.rs line 109) a user defined error of type CallRequest is raised, which is a wrapped around hyper::Error using snafu. This error is of type HttpError. The error is caught at (src/sinks/util/retries.rs line 124) where error.downcast_ref::<L::Error>() tries to get the type of error. If the type is hyper::Error (in case http sink) then retry happens, but we wrapped the error to be CallRequest , so generic error handling line 140, is done and code exits instead of retrying.

harningt commented 3 years ago

This behavior also seems to be the case for the docker image for 0.11.0 (for the loki sink, but fails in the HTTP request section as well), Using 0.10, I get this sort of error if it catches it just as Loki is started again.

Feb 07 03:09:11.651  WARN sink{name=loki_standard type=loki}:request{request_id=0}: vector::sinks::util::retries2: retrying after error: error trying to connect: tcp connect error: Connection refused (os error 111)
Feb 07 03:09:16.657 ERROR sink{name=loki_standard type=loki}:request{request_id=0}: vector::sinks::util::retries2: encountered non-retriable error. error=connection error: Connection reset by peer (os error 104)
Feb 07 03:09:16.658 ERROR sink{name=loki_standard type=loki}:request{request_id=0}: vector::sinks::util::sink: Request failed. error=connection error: Connection reset by peer (os error 104)`

I'm hoping that putting this behind an authenticating web proxy will reduce the chance that this can cause an error outside of explicit retry testing... but it would be nice for total failures to be retried. Will definitely need to tune the configuration file so that in case of outage - I can find out what errors did not make it into storage... at which point I'll have to figure out an efficient mechanism to re-inject the missed log messages.

nishantmalhotra13 commented 3 years ago

created a PR to fix retry issue in v0.11.1 branch : https://github.com/timberio/vector/pull/6454

drunkirishcoder commented 3 years ago

We just ran into this issue in our staging environment for ElasticSearch sink. I was in the middle of reproducing this when I see a fix submitted! 💯 Would like to see this fix included in the 0.12 release, otherwise it would create a lot of production issues with randomly dropped logs I think.

Just to add some context. From what I've observed when trying to reproduce this bug. This happened for us when an established connection is severed. The next request trying to use the severed connection will just fail and cause the events to be dropped without any retry.

P.S. also this doesn't appear to be a regression. same problem exists in 0.11.0 as well.

nishantmalhotra13 commented 3 years ago

Hi @drunkirishcoder
Thanks for the comment. I branched from v0.11.1 branch, and haven't looked into the 0.12 changes yet (plannnig to do that next, probably this weekend). I believe the issue was introduced with this commit And, you are right, the retry fails when the error happens on vector itself. It can be failure while establishing connection. The retry logic still works if we get some retryable http error code like 500

nishantmalhotra13 commented 3 years ago

Hi @drunkirishcoder update : I applied the code changes in master branch, and it still works. I haven't tested the elasticsearch and clickhouse sinks, as I don't have the necessary setup, but they use same http logic to send logs.