We can get a 2x to 3x improvement by specifying the partitions explicitly and not letting Athena infer them in queries. This also scans less data and possibly fewer S3 records.

Experiment

Doing a range query is not as efficient as specifying the partitions directly. Compare these:

Using the exact partition field (page_opened_at_date):

AND (
  page_opened_at_date = '2023-08-27' OR
  page_opened_at_date = '2023-08-28' OR
  page_opened_at_date = '2023-08-29' OR
  page_opened_at_date = '2023-08-30' OR
  page_opened_at_date = '2023-08-31' OR
  page_opened_at_date = '2023-09-01' OR
  page_opened_at_date = '2023-09-02' OR
  page_opened_at_date = '2023-09-03' OR
  page_opened_at_date = '2023-09-04')

Results (71) | Time in queue: 143 ms | Run time: 1.308 sec | Data scanned: 1.12 MB

Letting Athena extrapolate the partition from the timestamp field(page_opened_at):

 page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 

Results (71) | Time in queue: 129 ms | Run time: 4.663 sec | Data scanned: 1.49 MB

Results

Query Type
Specify partitions	1.3s
Infer partitions from range query	4.7s

Conclusion

Possible reasons for the increase:

Simplified logic, less planning
Athena does not have to identify partitions, we explicitly specify them
Less data to be scanned, the range query scans more than it needs to

The initial assumption that Athena will infer the partitions if we are using automatic partition projection is still correct. But there seems to be quite a significant performance loss if it infers the partitions automatically on range queries.

We will specify the exact partition instead of the range query. The Athena query maximum length is ±260k, if 1 date condition (page_opened_at_date = '2023-08-24' OR) is 37 chars, assuming 31 days and 12 months then the extra length this adds is 31*37*12=13764 bytes or about 5.3% the maximum allowed length. Meaning it will even support 10 year queries as the queries we have are not as complex.

Exact queries used in the test:

WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') 
              AND (
              page_opened_at_date = '2023-08-27' OR
              page_opened_at_date = '2023-08-28' OR
              page_opened_at_date = '2023-08-29' OR
              page_opened_at_date = '2023-08-30' OR
              page_opened_at_date = '2023-08-31' OR
              page_opened_at_date = '2023-09-01' OR
              page_opened_at_date = '2023-09-02' OR
              page_opened_at_date = '2023-09-03' OR
              page_opened_at_date = '2023-09-04')
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1 AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC

WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC

rehanvdm / serverless-website-analytics

Increase queries performance and efficency #41

Experiment

Results

Conclusion