Kinesis Firehose InvalidSignatureException on EC2 instances after prolonged running

surculus12 commented 3 years ago

Async AWS SDK for Python version: aioboto3==8.2.0
Python version: Python 3.7.9
Operating System: Amazon Linux 2, arm64, Kernel Linux 4.14.209-160.339.amzn2.aarch64

Description

I am trying to consistently put data into a Kinesis Firehose pipe. Whenever I receive some data through a websocket, I will create a task using asyncio.create_task that takes that objects and performs a put on the Firehose client (which is created using the contextlib manager, because it's too expensive to create it every single time). This works amazingly for a while, until eventually it begins to throw errors:

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the PutRecord operation: Signature expired: 20210119T033327Z is now earlier than 20210119T033329Z (20210119T033829Z - 5 min.)

As far as I can tell, this cannot be due to time drift on the EC2, because if I restart the program it will start chugging data out like nothing. I am conserned that the nature of the asyncio context here is causing a delay on messages, so they don't arrive soon enough to be accepted by Firehose or something like that, but I'm not familiar enough with the signature mechanics to say.

NOTE: I am sharing a Firehose client for the whole program, and creating multiple put tasks at the same time. I was unable to create multiple Firehose clients, as this would cause botocore to throw no credentials found errors (if you created enough clients, it seems bizarre). Perhaps my approach is wrong, and I should only be awaiting a single put on each client, and have a pool of clients, but then we might need to look at the credential error issue.

terricain commented 3 years ago

Using a single client should be fine. So I think this is coming from aiobotocore, can you run a test to see if this happens with aiobotocore without using aioboto3?

You just running this on a plain old EC2, no kubernetes or ECS involved?

surculus12 commented 3 years ago

Hey! I have two theories for why this was occuring and I've restructured things using aiobotocore for more control and no longer have the problem.

Spikes in data coming through the socket I was passing along to Firehose caused tasks to have to wait for cpu time. Assuming they yielded to the event loop after creating a signature, they could expire before they've finished. I'm not familiar enough with the underlying logic to know if that makes sense.
Re-use of clients caused some unintended consequences. It's not clear if that's possible.

The current structure feeds everything into a central queue with a pool of firehose clients that batch put records, reducing the load significantly and runs without a hitch. It'll be a big hassle to run this again to debug, but I think I was just using the libraries in a dumb way.

terricain commented 3 years ago

Yeah its entirely possible if its busy enough to delay the renewal of credentials

terricain / aioboto3

Kinesis Firehose InvalidSignatureException on EC2 instances after prolonged running #220

Description