rehanvdm / serverless-website-analytics

A CDK construct that consists of a serverless backend, frontend and client side code to track website analytics
GNU General Public License v2.0
167 stars 13 forks source link

Increase queries performance and efficency #41

Closed rehanvdm closed 1 year ago

rehanvdm commented 1 year ago

We can get a 2x to 3x improvement by specifying the partitions explicitly and not letting Athena infer them in queries. This also scans less data and possibly fewer S3 records.

Experiment

Doing a range query is not as efficient as specifying the partitions directly. Compare these:

Using the exact partition field (page_opened_at_date):

AND (
  page_opened_at_date = '2023-08-27' OR
  page_opened_at_date = '2023-08-28' OR
  page_opened_at_date = '2023-08-29' OR
  page_opened_at_date = '2023-08-30' OR
  page_opened_at_date = '2023-08-31' OR
  page_opened_at_date = '2023-09-01' OR
  page_opened_at_date = '2023-09-02' OR
  page_opened_at_date = '2023-09-03' OR
  page_opened_at_date = '2023-09-04')

Results (71) | Time in queue: 143 ms | Run time: 1.308 sec | Data scanned: 1.12 MB

image

Letting Athena extrapolate the partition from the timestamp field(page_opened_at):

 page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 

Results (71) | Time in queue: 129 ms | Run time: 4.663 sec | Data scanned: 1.49 MB

image

Results

Query Type
Specify partitions 1.3s
Infer partitions from range query 4.7s

Conclusion

Possible reasons for the increase:

The initial assumption that Athena will infer the partitions if we are using automatic partition projection is still correct. But there seems to be quite a significant performance loss if it infers the partitions automatically on range queries.

We will specify the exact partition instead of the range query. The Athena query maximum length is ±260k, if 1 date condition (page_opened_at_date = '2023-08-24' OR) is 37 chars, assuming 31 days and 12 months then the extra length this adds is 31*37*12=13764 bytes or about 5.3% the maximum allowed length. Meaning it will even support 10 year queries as the queries we have are not as complex.


Exact queries used in the test:

WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') 
              AND (
              page_opened_at_date = '2023-08-27' OR
              page_opened_at_date = '2023-08-28' OR
              page_opened_at_date = '2023-08-29' OR
              page_opened_at_date = '2023-08-30' OR
              page_opened_at_date = '2023-08-31' OR
              page_opened_at_date = '2023-09-01' OR
              page_opened_at_date = '2023-09-02' OR
              page_opened_at_date = '2023-09-03' OR
              page_opened_at_date = '2023-09-04')
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1 AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC
WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC
rehanvdm commented 1 year ago

Released in v1.2.0 #45