Presto Workload Analyzer

Presto Workload Analyzer Logo

The Workload Analyzer collects Presto® and Trino workload statistics, and analyzes them. The analysis provides improved visibility into your analytical workloads, and enables query optimization - to enhance cluster performance.

The Presto® Workload Analyzer collects, and stores, QueryInfo JSONs for queries executed while it is running, and any historical queries held in the Presto® Coordinator memory.

The collection process has negligible compute-costs, and does not impact cluster query execution in any way. Ensure that sufficient disk space is available in your working directory. Typically, a compressed JSON file size will be 50kb - 200kb.

Features
Supported Versions of Presto
Installation
Usage
Screencasts
Advanced Features
Notes

Features

Continuously collects and stores QueryInfo JSONs, in the background without impacting query performance.
Summarizes key query metrics to a summary.jsonl file.
Generates an analysis report:
- Query detail- query peak memory, input data read by query, and joins distribution.
- Table activity- wall time utilization, and input bytes read, by table scans.
- Presto® Operators- wall time usage, and input bytes read, by operator.

Supported Versions of Presto

The Workload Analyzer supports the following versions:

Trino (FKA PrestoSQL)- 402 and older.
PrestoDB- 0.245.1 and older.
Starburst Enterprise- 402e and older.
Dataproc- 1.5.x and older.

Although the Workload Analyzer may run with newer versions of Presto®, these scenarios have not been tested.

Installation

For installation, see here.

Usage

Local machine/ Remote machine

First, go to the analyzer directory, where the Workload Analyzer Python code can be found.

cd analyzer/

To collect statistics from your cluster, run the following script for a period that will provide a representative sample of your workload.

./collect.py -c http://<presto-coordinator>:8080 --username-request-header "X-Trino-User" -o ./JSONs/ --loop

Notes:

In most cases, this period will be between 5 and 15 days, with longer durations providing more significant analysis.
The above command will continue running until stopped by the user (Ctrl+C).

To analyze the downloaded JSONs directory (e.g. ./JSONs/) and generate a zipped HTML report, execute the following command:

./extract.py -i ./JSONs/ && ./analyze.py -i ./JSONs/summary.jsonl.gz -o ./output.zip

Docker

To collect statistics from your cluster, run the following script for a period that will provide a representative sample of your workload.

$ mkdir JSONs/
$ docker run -v $PWD/JSONs/:/app/JSONs analyzer ./analyzer/collect.py -c http://$PRESTO_COORDINATOR:8080 --username-request-header "X-Trino-User" -o JSONs/ --loop

To analyze the downloaded JSONs directory (e.g. ./JSONs/), and generate a zipped HTML report, execute the following commands:

$ docker run -v $PWD/JSONs/:/app/JSONs analyzer ./analyzer/extract.py -i JSONs/
$ docker run -v $PWD/JSONs/:/app/JSONs analyzer ./analyzer/analyze.py -i JSONs/summary.jsonl.gz -o JSONs/output.zip

Notes:

In most cases, this period will be between 5 and 15 days, with longer durations providing more significant analysis.
The above command will continue running until stopped by the user (Ctrl+C).

Screencasts

See the following screencasts for usage examples:

Collection

Analysis

Advanced Features

In exceptional circumstances, it may be desirable to do one or more of the following:
1. Obfuscate the schema names
2. Remove the SQL queries from the summary file
3. Analyze queries for a specific schema (joins with other schemas are included)

To enable these requirements, the ./jsonl_process.py script may be executed, after the ./extract.py script, but before the ./analyze.py script.

In the example below, only queries from the transactions schema are kept, and the SQL queries are removed from the new summary file:

./jsonl_process.py -i ./JSONs/summary.jsonl.gz -o ./processed_summary.jsonl.gz --filter-schema transactions --remove-query

In the following example, all the schema names are obfuscated:

./jsonl_process.py -i ./JSONs/summary.jsonl.gz -o ./processed_summary.jsonl.gz --rename-schemas

In the following example, all the partition and user names are obfuscated:

./jsonl_process.py -i ./JSONs/summary.jsonl.gz -o ./processed_summary.jsonl.gz --rename-partitions --rename-user

After the ./jsonl_process.py script has been executed, to generate a report based on the new summary file, run:

./analyze.py -i ./processed_summary.jsonl.gz -o ./output.zip

To create a high-contrast report, use the --high-contrast-mode parameter, for example:
```
./analyze.py --high-contrast-mode -i ./JSONs/summary.jsonl.gz -o ./output.zip
```

Notes

Presto® is a trademark of The Linux Foundation.

varadaio / presto-workload-analyzer

readme