We can get a 2x to 3x improvement by specifying the partitions explicitly and not letting Athena infer them in queries. This also scans less data and possibly fewer S3 records.
Experiment
Doing a range query is not as efficient as specifying the partitions directly. Compare these:
Using the exact partition field (page_opened_at_date):
AND (
page_opened_at_date = '2023-08-27' OR
page_opened_at_date = '2023-08-28' OR
page_opened_at_date = '2023-08-29' OR
page_opened_at_date = '2023-08-30' OR
page_opened_at_date = '2023-08-31' OR
page_opened_at_date = '2023-09-01' OR
page_opened_at_date = '2023-09-02' OR
page_opened_at_date = '2023-09-03' OR
page_opened_at_date = '2023-09-04')
Results (71) | Time in queue: 143 ms | Run time: 1.308 sec | Data scanned: 1.12 MB
Letting Athena extrapolate the partition from the timestamp field(page_opened_at):
page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS')
Results (71) | Time in queue: 129 ms | Run time: 4.663 sec | Data scanned: 1.49 MB
Results
Query Type
Specify partitions
1.3s
Infer partitions from range query
4.7s
Conclusion
Possible reasons for the increase:
Simplified logic, less planning
Athena does not have to identify partitions, we explicitly specify them
Less data to be scanned, the range query scans more than it needs to
The initial assumption that Athena will infer the partitions if we are using automatic partition projection is still correct. But there seems to be quite a significant performance loss if it infers the partitions automatically on range queries.
We will specify the exact partition instead of the range query. The Athena query maximum length is ±260k, if 1 date condition (page_opened_at_date = '2023-08-24' OR) is 37 chars, assuming 31 days and 12 months then the extra length this adds is 31*37*12=13764 bytes or about 5.3% the maximum allowed length. Meaning it will even support 10 year queries as the queries we have are not as complex.
Exact queries used in the test:
WITH
cte_data AS (
SELECT user_id, country_name, page_opened_at,
ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
FROM page_views
WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests')
AND (
page_opened_at_date = '2023-08-27' OR
page_opened_at_date = '2023-08-28' OR
page_opened_at_date = '2023-08-29' OR
page_opened_at_date = '2023-08-30' OR
page_opened_at_date = '2023-08-31' OR
page_opened_at_date = '2023-09-01' OR
page_opened_at_date = '2023-09-02' OR
page_opened_at_date = '2023-09-03' OR
page_opened_at_date = '2023-09-04')
),
cte_data_filtered AS (
SELECT *
FROM cte_data
WHERE rn = 1 AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS')
),
user_distinct_stat AS (
SELECT
user_id, country_name,
COUNT(*) as "visitors"
FROM cte_data_filtered
WHERE country_name IS NOT NULL
GROUP BY 1, 2
ORDER BY 3 DESC
)
SELECT
country_name as "group",
COUNT(*) as "visitors"
FROM user_distinct_stat
GROUP BY country_name
ORDER BY visitors DESC
WITH
cte_data AS (
SELECT user_id, country_name, page_opened_at,
ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
FROM page_views
WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS')
),
cte_data_filtered AS (
SELECT *
FROM cte_data
WHERE rn = 1
),
user_distinct_stat AS (
SELECT
user_id, country_name,
COUNT(*) as "visitors"
FROM cte_data_filtered
WHERE country_name IS NOT NULL
GROUP BY 1, 2
ORDER BY 3 DESC
)
SELECT
country_name as "group",
COUNT(*) as "visitors"
FROM user_distinct_stat
GROUP BY country_name
ORDER BY visitors DESC
We can get a 2x to 3x improvement by specifying the partitions explicitly and not letting Athena infer them in queries. This also scans less data and possibly fewer S3 records.
Experiment
Doing a range query is not as efficient as specifying the partitions directly. Compare these:
Using the exact partition field (
page_opened_at_date
):Letting Athena extrapolate the partition from the timestamp field(
page_opened_at
):Results
Conclusion
Possible reasons for the increase:
The initial assumption that Athena will infer the partitions if we are using automatic partition projection is still correct. But there seems to be quite a significant performance loss if it infers the partitions automatically on range queries.
We will specify the exact partition instead of the range query. The Athena query maximum length is ±260k, if 1 date condition (
page_opened_at_date = '2023-08-24' OR
) is 37 chars, assuming 31 days and 12 months then the extra length this adds is31*37*12=13764 bytes
or about 5.3% the maximum allowed length. Meaning it will even support 10 year queries as the queries we have are not as complex.Exact queries used in the test: