vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.61k stars 1.55k forks source link

Add Retry Behaviour to AWS Kinesis Data Firehose Sink #12835

Open mattaltberg opened 2 years ago

mattaltberg commented 2 years ago

A note for the community

Use Cases

Currently, if any records are throttled by AWS, Vector does not retry sending the records to Firehose. We should add a retry to ensure records are sent, because right now, the data is simply dropped.

Attempted Solutions

No response

Proposal

No response

References

No response

Version

No response

jszwedko commented 2 years ago

Hi @mattaltberg ! What version of Vector were you using when you observed this behavior?

mattaltberg commented 2 years ago

Hey @jszwedko, I'm using 0.16.1. I'll ping you again after I try using the latest version.

jszwedko commented 2 years ago

Thanks @mattaltberg . I am curious about the newest version (0.21.2) since there were some changes to switch to the new AWS SDK and handle throttling responses consistently in 0.21.0.

mattaltberg commented 2 years ago

@jszwedko I tested using 0.21.2, and still got the same throttling issue with my Firehose stream. Is it possible it's not used for the Firehose sink?

jszwedko commented 2 years ago

Interesting. You are seeing throttle messages in Vector's logs? Can you share the log output?

The aws_kinesis_firehose sink appears to be using the same retry logic that includes looking for throttling responses. This is wired up here:

https://github.com/vectordotdev/vector/blob/498a4c823282e772221911864772ece7b4df258f/src/sinks/aws_kinesis_firehose/config.rs#L258-L270

which calls:

https://github.com/vectordotdev/vector/blob/498a4c823282e772221911864772ece7b4df258f/src/aws/mod.rs#L26-L63

jszwedko commented 2 years ago

If you are able to re-run with the environment variable VECTOR_LOG=aws_smithy_http=trace that should output trace messages from the AWS SDK that show the responses coming from AWS.

mattaltberg commented 2 years ago

@jszwedko The oddness continues. I'm not seeing any throttling exceptions, but I am seeing plenty of ServiceUnavailableException

2022-05-24T16:36:43.633946Z TRACE send_operation{operation="PutRecordBatch" service="firehose"}:load_response: aws_smithy_http::middleware: http_response=Response { status: 200, version: HTTP/1.1, headers: {"x-amzn-requestid": "c31f7dea-c015-1430-9818-763094685e86", "x-amz-id-2": "uWdgQME8lqQP+rR4lAa2AFXG4eTXIRgbYzNmTSZJmE8kiZAiJ00qDEgI+PKYE/MsJjpFtqlAg79oz2fC47BN28eYMPQEKAqf", "content-type": "application/x-amz-json-1.1", "content-length": "2124", "date": "Tue, 24 May 2022 16:36:43 GMT"}, body: b"{\"Encrypted\":false,\"FailedPutCount\":12,\"RequestResponses\":[{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"RecordId\":\"O9udtn6NoXmtLlk1cDIlGtceB7ZpAOhv2f+sBNg50TsSJMUTUHOz85GYSkSjnCvq05dr1OsEwwSzUPm9lGUmxtXPp50tzAPp5rUwaA3DV+3fBmxZjklJtrJf/kDCJJuRWdoTrkmIUjD9UZcxxANnTWZSaI3MyJr47ZJD9zyKI3190KmoCP8P8eXVB4Aor82/GAgCIqevqqU4lPb8r3dQvh292Us1BgDG\"},{\"RecordId\":\"pzfSBJvZEMAumH1LGhEw/Z29V5S8SfbVwQAUkfzkIehEA4EJeRD78kd0uEJRElbK5JQ0S3Xfc4WUanXNbWne8hCSNXik1DNeWhGQlFl9UiWcPsPHQrO8ccqLpPItB8ZhlHi8yf1iFT6795qF+a0rUROHUnxjuBKd+UmNGAfYeb0P5QqITX0jrB2B4gM9qRCpR1ZITJgKTUrGNqRuyFIfCmGTaHDa0haA\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"RecordId\":\"NhvJnZSom65Mf3izcf65SujsuJRSjJ+PgUQRFBI1iddKqd3lx4SlR9J1SQVmoYj0G4op6ZrDbB3MqROoMWHIsUPB5CXFA713Ubg9Yo4KAkyQaUUiVWaXji/noADLL0US7/TR8f87SWlqc7eujli2rrM50GMGkeoP1txB6+MGxIH74MPWGNhaDd6dM1yDThBaUOVHt/excooPGSKq/M6gLeQrmen2l+z0\"},{\"RecordId\":\"2anxfuWOKQ2TMcILhDIWRIxwqjIc2f0kBULRuxUMZE0KWc4k9w9jET9/yn0i8XPd85hhF/d74oPw4r6eigYShOChieRVJ+b9eu7Eb8hgpomOxMlGPF/ytDWn4ktQgpCwO950kLN3hGfcVjsvvqfBTJEnZfy/bZB7Iv8x9/LeImfqb69Q7Vf5MkN9/ZqyuvMHfEIK9/BPah2++mR+BWQaYUIjHhqGGb/P\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"RecordId\":\"P19NKA8A4/tnfqkEPV0phkFa40dDpsl2cSwiSXWVOHMYhoLVgmpqz4E+yh9LnDIYA7GQDzTjpHp0UQQI+TB7d4u8vZpk7Zg05JOTPhOlnPLmJVNbEnOUnOE/Gg5c60wxLkhsSlO8C24AsEzokNL0B2DMyLnIzW3SFzUFY8VcNCgOWsNG2KymDZ0REIwfRdYmJLbIjjwR4fG8xTjt4THCNoJg2KNg7W76\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"},{\"ErrorCode\":\"ServiceUnavailableException\",\"ErrorMessage\":\"Slow down.\"}]}" }

The thing is, when I check AWS Console, I can see a bunch of records in the Throttled Records chart, which is surprisingly not showing up in the SDK errors

jszwedko commented 2 years ago

Interesting, thanks for sharing the response @mattaltberg ! It seems like maybe we should be looking for ServiceUnavailableException in that code block I linked, too. cc/ @fuchsnj for thoughts.

jszwedko commented 2 years ago

Also noticing that it seems to return one response per record, in your log some of the records went in. It seems like we'll need to handle retrying only subsets of the records (similar to Elasticsearch). At the least, for an initial implementation, we could retry the full request though.

mattaltberg commented 2 years ago

Yeah, if you want more information, my setup is using a unix_stream socket, reading from /var/log/output.sock. It then gets transformed (new line gets added to the end of the message) before the Firehose sinks gets it.

mattaltberg commented 2 years ago

Hey @jszwedko any updates? I'm curious if I'll need to keep upping my Firehose quota

jszwedko commented 2 years ago

Hi @mattaltberg ! Unfortunately nothing yet, but this is in the backlog.

dengmingtong commented 1 year ago

Is this issue already fixed? Or is there any plan to fix it?

jszwedko commented 1 year ago

It's not fixed yet, but a contributor has a PR open for it: https://github.com/vectordotdev/vector/pull/16771