rehanvdm / serverless-website-analytics

A CDK construct that consists of a serverless backend, frontend and client side code to track website analytics
GNU General Public License v2.0
162 stars 11 forks source link

Explore using AWS Timestream instead of Athena #64

Closed rehanvdm closed 8 months ago

rehanvdm commented 8 months ago

Exploration

Architecture

The https://github.com/rehanvdm/serverless-website-analytics/tree/experiment-timestream branch contains the experiment/poc with Amazon Timestream. The PR showing the changes can be seen here.

image

The ingest API stays the same and is writing to a Kinesis Firehose that has a buffer window of 1 minute, it then stores the data as JSONL (JSON Lines) on an S3 bucket. The S3 bucket sends create notifications to an SQS queue, the SQS then triggers a Lambda function that writes it into Timestream in batches of 100.

Performance

Timestream does not like to return big result sets. Athena is 3 to 4 times faster in small result sets and up to 40 times faster with big result sets.

https://twitter.com/der_rehan/status/1723782017462993121

image

FAQ

Why use Firehose before the SQS?

With SQS you can only specify the maximum wait period, it can trigger your function anytime before that. With Firehose it will wait exactly that time, or until the buffer is full. We want to wait exactly that time, in this case 1 minute, as we will have multiple page view events (first for opening the page, second for time on the page). The Lambda function can then dedupe them a little and only write 1 record instead of two, reducing the writes we do to Timestream.

rehanvdm commented 8 months ago

Final verdict

Against Timestream

Benefits of Timestream

Decision

We will not be going with Timestream, the Athena based solution we have is faster. It works out to about the same amount per month, rough calculations. I have a hunch that it *might be cheaper to query if querying thousands and thousands of times, but also not sure. The Timestream solution is nice because then I don't have to define the schema in IaC (on the Firehose and Athena table) but it also introduces more error handling, with the SQS and Lambda writing into Timestream.

In the end, the benefits do not outweigh the cons. We will be sticking with Athena, for now.