gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.29k stars 1.73k forks source link

Optimize Athena queries to address cumulative 1GB read limitation #41544

Open tigrato opened 3 months ago

tigrato commented 3 months ago

Description:

Issue:

Currently, Teleport relies on Athena queries for retrieving logs within specified date ranges. While this approach has been effective, it faces a critical limitation due to Athena's cumulative 1GB read limitation. When the cumulative size of files containing logs for a given date range exceeds 1GB, the query fails silently, leading to potential disruptions in system functionality without proper error handling or notification to the client.

Problem Statement:

The cumulative 1GB read limitation imposed by Athena poses a significant challenge for Teleport, particularly when querying logs spanning extensive date ranges or dealing with high-volume log data. When this limit is exceeded, the query abruptly terminates, leaving the system failing to deliver the expected results to the client. Even with limits on the queries, the query fails because Athena must read all files.

Proposed Solution:

To address this limitation and ensure the reliability of log retrieval operations, it's essential to optimize Athena queries not only based on costs but also with a focus not reaching cumulative data read limits.

https://github.com/gravitational/teleport-plugins/pull/1068 implements a mitigation to the problem by requesting smaller windows but the problem should be handled by the Athena backend itself if possible

rosstimothy commented 3 months ago

Short term: we should update the limit set on the workgroup to something higher than 1GB.

Long term: we either need to better track how much data exists for a given time period and proactively limit the timeframe we query for, or we need to alter our data model to allow for better partitioning of the data.

From running a few test queries manually the biggest factor in the amount of data scanned was the ORDER BY clause.

SELECT DISTINCT uid, event_time, event_data FROM auditevents WHERE event_date BETWEEN date(?) AND date(?) AND event_time BETWEEN ? and ? ORDER BY event_time DESC, uid DESC LIMIT 5000;

The event_time column is the timestamp that the event occurred at and is unable to be used as a partition due to high cardinality. When the ORDER BY conditions are removed the amount of data scanned goes from ~5GB down to ~18MB.

Another area with room for improvement is the file size of the parquet files stored in S3. Auth currently buffers messages received from SQS for a very short while and then writes them to S3 to reduce the time it takes for audit events to be visible to users. However, that results in lots of very small files(<5MB) in S3. All of the documentation regarding best practices recommends storing files in S3 that are 128MB in size. The AWS glue documentation specifically mentions the following

Each file stored inside a partition should be at least 128 MB to a maximum of one GB to get ensure that AWS Glue (Spark) can read and process the data efficiently. If the file sizes are too small (KBs to few MBs), AWS Glue will spend more time in I/O and can lead to degraded performance.

tigrato commented 3 months ago

From running a few test queries manually the biggest factor in the amount of data scanned was the ORDER BY clause. ... The event_time column is the timestamp that the event occurred at and is unable to be used as a partition due to high cardinality. When the ORDER BY conditions are removed the amount of data scanned goes from ~5GB down to ~18MB.

This is only valid if no search is specified. Once we have some kind of search it will scan more data if he doesn't find all the data.

We can attempt to optimize by ensuring that events are written in order before being saved to Parquet. This allows us to eliminate the need for the ORDER BY clause. As a result, non-filtered queries will be more efficient, as Athena will ingest only the necessary data rather than all data within the period. This approach will significantly reduce costs for default queries and will allow us to optimize other queries that require some filtering by event types