varadaio / presto-workload-analyzer

The Workload Analyzer collects Presto® and Trino workload statistics, and analyzes them
GNU General Public License v3.0
135 stars 38 forks source link
analysis analyzer cluster presto prestodb prestosql trino trinodb visualization workloads

Presto Workload Analyzer

Presto Workload Analyzer Logo

The Workload Analyzer collects Presto® and Trino workload statistics, and analyzes them. The analysis provides improved visibility into your analytical workloads, and enables query optimization - to enhance cluster performance.

The Presto® Workload Analyzer collects, and stores, QueryInfo JSONs for queries executed while it is running, and any historical queries held in the Presto® Coordinator memory.

The collection process has negligible compute-costs, and does not impact cluster query execution in any way. Ensure that sufficient disk space is available in your working directory. Typically, a compressed JSON file size will be 50kb - 200kb.

Table of contents

Features

Supported Versions of Presto

The Workload Analyzer supports the following versions:

  1. Trino (FKA PrestoSQL)- 402 and older.
  2. PrestoDB- 0.245.1 and older.
  3. Starburst Enterprise- 402e and older.
  4. Dataproc- 1.5.x and older.

Although the Workload Analyzer may run with newer versions of Presto®, these scenarios have not been tested.

Installation

For installation, see here.

Usage

Local machine/ Remote machine

First, go to the analyzer directory, where the Workload Analyzer Python code can be found.

cd analyzer/

To collect statistics from your cluster, run the following script for a period that will provide a representative sample of your workload.

./collect.py -c http://<presto-coordinator>:8080 --username-request-header "X-Trino-User" -o ./JSONs/ --loop

Notes:

  1. In most cases, this period will be between 5 and 15 days, with longer durations providing more significant analysis.
  2. The above command will continue running until stopped by the user (Ctrl+C).

To analyze the downloaded JSONs directory (e.g. ./JSONs/) and generate a zipped HTML report, execute the following command:

./extract.py -i ./JSONs/ && ./analyze.py -i ./JSONs/summary.jsonl.gz -o ./output.zip

Docker

To collect statistics from your cluster, run the following script for a period that will provide a representative sample of your workload.

$ mkdir JSONs/
$ docker run -v $PWD/JSONs/:/app/JSONs analyzer ./analyzer/collect.py -c http://$PRESTO_COORDINATOR:8080 --username-request-header "X-Trino-User" -o JSONs/ --loop

To analyze the downloaded JSONs directory (e.g. ./JSONs/), and generate a zipped HTML report, execute the following commands:

$ docker run -v $PWD/JSONs/:/app/JSONs analyzer ./analyzer/extract.py -i JSONs/
$ docker run -v $PWD/JSONs/:/app/JSONs analyzer ./analyzer/analyze.py -i JSONs/summary.jsonl.gz -o JSONs/output.zip

Notes:

  1. In most cases, this period will be between 5 and 15 days, with longer durations providing more significant analysis.
  2. The above command will continue running until stopped by the user (Ctrl+C).

Screencasts

See the following screencasts for usage examples:

Collection

asciicast

Analysis

asciicast

Advanced Features

To enable these requirements, the ./jsonl_process.py script may be executed, after the ./extract.py script, but before the ./analyze.py script.

In the example below, only queries from the transactions schema are kept, and the SQL queries are removed from the new summary file:

./jsonl_process.py -i ./JSONs/summary.jsonl.gz -o ./processed_summary.jsonl.gz --filter-schema transactions --remove-query 

In the following example, all the schema names are obfuscated:

./jsonl_process.py -i ./JSONs/summary.jsonl.gz -o ./processed_summary.jsonl.gz --rename-schemas 

In the following example, all the partition and user names are obfuscated:

./jsonl_process.py -i ./JSONs/summary.jsonl.gz -o ./processed_summary.jsonl.gz --rename-partitions --rename-user 

After the ./jsonl_process.py script has been executed, to generate a report based on the new summary file, run:

./analyze.py -i ./processed_summary.jsonl.gz -o ./output.zip

Notes

Presto® is a trademark of The Linux Foundation.