Consider expanding the cases where Vector retries requests

jszwedko commented 2 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Current behavior

Vector's retry behavior varies by sink, but typically we retry requests that are viewed to be temporary failures that we expect to recover. This includes things like HTTP 503s and 429s. This does not include failures that are viewed as non-temporary like HTTP 403s and 404s. Events in these requests are dropped and Vector continues processing.

Possible issue

The above means that Vector drops events under circumstances like:

Misconfiguration in Vector like using an invalid Datadog API key
Misconfiguration in upstreams such as a deploy of an internal Elasticsearch instance causing it to start returning 403s where previously the credentials were valid or provider outages like when a GCP outage caused customer load balancers to start returning 404s

Under these circumstances, I think it is reasonable for users to not want Vector to drop events, but just retry and block processing until a human can intervene and fix the issue.

Idea

One idea would be to retry all failures up until a maximum number of retries after which failed batches would be routed to a dead-letter queue (unimplemented).

I'm curious to hear other thoughts here though. This is a nuanced issue.

Refs

jerome-kleinen-kbc-be commented 2 years ago

Next to the retry behavior (number of attempts etc.) perhaps the http status codes to retry could be configurable, of course with sensible defaults. I am struggling to picture how this could look, I guess the configuration will quickly become difficult to read/understand.

I do like your suggestion to route events to a new queue after having failed x retry attempts.

akutta commented 2 years ago

I could see scenarios where dead-letters are useful; however, I'd like to see it configurable.

When using the Elastic sink ( #10839 ) we recently ran into an issue where the response was:

ERROR sink{component_id=out_aws component_kind="sink" component_type=elasticsearch component_name=out_aws}:request{request_id=196897}: vector::sinks::elasticsearch::retry: Response contained errors. response=Response { status: 200, version: HTTP/1.1, headers: {"date": "Mon, 07 Feb 2022 03:11:13 GMT", "content-type": "application/json; charset=UTF-8", "content-length": "1245855", "connection": "keep-alive", "access-control-allow-origin": "*"}, body: b"{\"errors\":true,\"items\":[{\"index\":{\"_id\":null,\"status\":403,\"error\":{\"type\":\"index_create_block_exception\",\"reason\":\"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];\"}}}

In this case, rebalancing shards across our Elastic Domain resolved the underlying issue. We are indexing at a rate of ~150M/min so a dead-letter queue becomes less useful. If instead we applied back-pressure and stopped acking entirely, our messages would have been queued upstream.

Note: This was the internal status. The Request status was a 200.

jszwedko commented 2 years ago

More context in https://github.com/vectordotdev/vector/issues/655

jszwedko commented 2 years ago

Per https://github.com/vectordotdev/vector/pull/13130#issuecomment-1155201793 we should also see if we can encapsulate this logic to share it for HTTP clients.

jszwedko commented 2 years ago

kevinpark1217 commented 2 years ago

@jszwedko On #13414, you mentioned this work might be picked up on Q3. Is there any plan for this to be implemented?

jszwedko commented 2 years ago

Hi @kevinpark1217 ! This is still on our nearterm roadmap, but may not make it in Q3.

unkempthenry commented 1 year ago

@jszwedko is this still on the roadmap? Also interested in this for the Splunk sink like in #13414. Splunk cloud will return 404s very intermittently from their load balancers.

jszwedko commented 1 year ago

@jszwedko is this still on the roadmap? Also interested in this for the Splunk sink like in #13414. Splunk cloud will return 404s very intermittently from their load balancers.

Hey! We plan to write an RFC around this this quarter.

yoelk commented 1 year ago

@jszwedko our configuration is Kafka->Filter->S3 with acknowledgements, and like others have mentioned, events are dropped in certain cases. The one case we checked (which was the easiest) is setting the wrong role name in AWS.

I've seen in the description that events would be dropped depending on the HTTP response the sink gets, and whether it's seen as temporary or not. Until this issue is resolved, I'd like to evaluate the risk for our clients' data, depending on each scenario (wrong password being one of them).

Is there a place where I can see the different cases? and which ones will result un dropped events? Thanks a lot!

jszwedko commented 1 year ago

I responded in Discord, but unfortunately doing this sort of survey would mean spelunking through the source code.

cameronbraid commented 10 months ago

Is this still on the roadmap?

jszwedko commented 10 months ago

Is this still on the roadmap?

It's definitely still on our radar as high impact area to improve, but probably nothing happening on this before the end of the year given competing priorities. We are open to seeing PRs addressing this for individual sinks though. We've seen some already 🙂

yoelk commented 10 months ago

@jszwedko Can you please point out the PRs you mentioned? If the effort is not big, we might be interested in creating such PRs for aws_s3 and azure_blob sinks. Would be great to have a reference for such an effort 🙏

jszwedko commented 10 months ago

Hi @yoelk ,

That would great! Thanks for the interest! Here's a couple of examples:

For the two sinks you mentioned it would mean updating:

Hopefully this helps get you going in the right directions. Just let us know if you need some more pointers!

jiaozi07 commented 7 months ago

elasticsearch Is this still on the roadmap?

suikast42 commented 7 months ago

The same is if you use journal_d source and nats sink. Failed logs to send nats are acked by vector.

danielcollishaw commented 3 months ago

Hello @jszwedko! I hope I am not bothering or reviving a dead thread, but is this still on the road map?

I am using Vector to push to S3 in a storage intensive environment with an in memory buffer. We are rotating AWS STS credentials with a 2 hour buffer period to avoid writing with invalid credentials and losing data within the buffer. We would like to bring the credential duration down for security concerns and do away with the buffer period.

We were wondering if retrying a 403 until new credentials are provided, or a retry limit is hit, would help us work towards consistency without committing to a disk buffer? I appreciate any insight and would be happy to help in any way.

jszwedko commented 3 months ago

Hello @jszwedko! I hope I am not bothering or reviving a dead thread, but is this still on the road map?

I am using Vector to push to S3 in a storage intensive environment with an in memory buffer. We are rotating AWS STS credentials with a 2 hour buffer period to avoid writing with invalid credentials and losing data within the buffer. We would like to bring the credential duration down for security concerns and do away with the buffer period.

We were wondering if retrying a 403 until new credentials are provided, or a retry limit is hit, would help us work towards consistency without committing to a disk buffer? I appreciate any insight and would be happy to help in any way.

No worries! We are still chipping away at this as we go, but we haven't been able to make a concerted effort.

For the AWS S3 sink, specifically, when using STS the AWS SDK should refresh the credentials prior to expiration so you shouldn't see any 403s.

danielcollishaw commented 3 months ago

For the AWS S3 sink, specifically, when using STS the AWS SDK should refresh the credentials prior to expiration so you shouldn't see any 403s.

Sorry I should have been a bit more descriptive onto why they can expire, we assume the STS role elsewhere and push the returned credentials to the constrictive hosts. Sometimes this push can lag out of sync, which is why we are using a buffer period on the duration of the role.

Thank you for the ongoing progress and the update on the roadmap, appreciate the speedy response!

jszwedko commented 3 months ago

Aha I see. It looks like the AWS components don't currently retry failed credentials requests, but I think it'd be a relatively straightforward change to https://github.com/vectordotdev/vector/blob/master/src/aws/mod.rs if you (or others) are interested.

danielcollishaw commented 3 months ago

I actually have a period of time next week that I may be able to allocate into this. Will look into contribution docs!

coutug commented 2 months ago

Hey @jszwedko, just hit the issue with 404 response for the http sink. Is there any update with that specific sink?

We currently run a k8s cluster and a other nginx servers outside the cluster. I have an vector instance on each nginx server to send its access logs to the k8s cluster's vector instance via the cluster's ingress. That way I get a centralized view of all the logs (nginx and k8s). While testing for points of failure, I realized that everything works except for when the k8s vector crashes since the ingress returns a 404 reponse to the nginx vector. Any advice for that situation?

jszwedko commented 2 months ago

Hey @jszwedko, just hit the issue with 404 response for the http sink. Is there any update with that specific sink?

We currently run a k8s cluster and a other nginx servers outside the cluster. I have an vector instance on each nginx server to send its access logs to the k8s cluster's vector instance via the cluster's ingress. That way I get a centralized view of all the logs (nginx and k8s). While testing for points of failure, I realized that everything works except for when the k8s vector crashes since the ingress returns a 404 reponse to the nginx vector. Any advice for that situation?

Would it be possible to have the ingress return a 503 instead (no upstream available?). Otherwise, I think we'd accept a contribution to the http sink to retry 404s.

coutug commented 2 months ago

Ah yes, found a way to return the proper status code from the ingress controller. Thanks a lot for the help!

vectordotdev / vector