terricain / aioboto3

Wrapper to use boto3 resources with the aiobotocore async backend
Apache License 2.0
732 stars 75 forks source link

Kinesis Firehose InvalidSignatureException on EC2 instances after prolonged running #220

Closed surculus12 closed 3 years ago

surculus12 commented 3 years ago

Description

I am trying to consistently put data into a Kinesis Firehose pipe. Whenever I receive some data through a websocket, I will create a task using asyncio.create_task that takes that objects and performs a put on the Firehose client (which is created using the contextlib manager, because it's too expensive to create it every single time). This works amazingly for a while, until eventually it begins to throw errors:

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the PutRecord operation: Signature expired: 20210119T033327Z is now earlier than 20210119T033329Z (20210119T033829Z - 5 min.)

As far as I can tell, this cannot be due to time drift on the EC2, because if I restart the program it will start chugging data out like nothing. I am conserned that the nature of the asyncio context here is causing a delay on messages, so they don't arrive soon enough to be accepted by Firehose or something like that, but I'm not familiar enough with the signature mechanics to say.

NOTE: I am sharing a Firehose client for the whole program, and creating multiple put tasks at the same time. I was unable to create multiple Firehose clients, as this would cause botocore to throw no credentials found errors (if you created enough clients, it seems bizarre). Perhaps my approach is wrong, and I should only be awaiting a single put on each client, and have a pool of clients, but then we might need to look at the credential error issue.

terricain commented 3 years ago

Using a single client should be fine. So I think this is coming from aiobotocore, can you run a test to see if this happens with aiobotocore without using aioboto3?

You just running this on a plain old EC2, no kubernetes or ECS involved?

surculus12 commented 3 years ago

Hey! I have two theories for why this was occuring and I've restructured things using aiobotocore for more control and no longer have the problem.

  1. Spikes in data coming through the socket I was passing along to Firehose caused tasks to have to wait for cpu time. Assuming they yielded to the event loop after creating a signature, they could expire before they've finished. I'm not familiar enough with the underlying logic to know if that makes sense.
  2. Re-use of clients caused some unintended consequences. It's not clear if that's possible.

The current structure feeds everything into a central queue with a pool of firehose clients that batch put records, reducing the load significantly and runs without a hitch. It'll be a big hassle to run this again to debug, but I think I was just using the libraries in a dumb way.

terricain commented 3 years ago

Yeah its entirely possible if its busy enough to delay the renewal of credentials