SparkScope

SparkScope is a monitoring and profiling tool for Spark Applications. It allows to review resource allocation, utiization, and demand timeline as it took place during Spark application execution. SparkScope presents information using visual charts which allow to

find bottlenecks in application execution,
reconcile resource demand and supply,
fine tune Spark application for desired objectives.

It is implemented as SparkListener which means that it runs inside driver and listens for Spark events. SparkScope utilizes csv metrics produced by custom SparkScopeCsvSink and supports multiple storage types.

SparkScope Report

SparkScope produces reports in the following formats

html
json

SparkScope HTML reports contains the following features:

Stats:
- Application Info
- Application Stats
- Resource Stats
- Executor Stats
- Driver Stats
Charts:
- total % of utilized cpu/heap charts
- total utilization vs allocation cpu/heap charts
- number of tasks vs CPU capacity and number of executors
- heap and non-heap charts for executors
- heap and non-heap charts for driver
Warnings:
- Low CPU utilization warning
- Low Memory utilization warning
- Data Spills from memory to disk warning
- Long time spent in Garbage Collection warning

Compatibility matrix

	Spark 2 (spark2 branch)	Spark 3 (main branch)
Scala version	2.11.12	2.12.18
compatible JDK versions	7, 8	8, 11, 17
compatible Spark versions	2.3, 2.4	3.2, 3.3, 3.4, 3.5

Compatible storage types:

S3
HDFS
MaprFS
NFS/local

Tested environments:

Hadoop Yarn(Client and Cluster deploy modes)
Spark Standalone cluster

Spark application configuration

parameter	type	sample values	description
spark.extraListeners	mandatory	com.ucesys.sparkscope.SparkScopeJobListener	Spark listener class
spark.metrics.conf.driver.source.jvm.class	mandatory	org.apache.spark.metrics.source.JvmSource	jvm metrics source for driver
spark.metrics.conf.executor.source.jvm.class	mandatory	org.apache.spark.metrics.source.JvmSource	jvm metrics source for executor
spark.metrics.conf.*.sink.csv.class	mandatory	org.apache.spark.metrics.sink.SparkScopeCsvSink	csv sink class
spark.metrics.conf.*.sink.csv.period	mandatory	5	period of metrics spill
spark.metrics.conf.*.sink.csv.unit	mandatory	seconds	unit of period of metrics spill
spark.metrics.conf.*.sink.csv.directory	mandatory	s3://my-bucket/path/to/metrics	path to metrics directory, can be s3,hdfs,maprfs,local
spark.metrics.conf.*.sink.csv.region	optional	us-east-1	aws region, required for s3 storage
spark.metrics.conf.*.sink.csv.appName	optional	MyApp	application name, also used for grouping metrics
spark.sparkscope.report.html.path	optional	s3://my-bucket/path/to/html/report/dir	path to which SparkScope html report will be saved
spark.sparkscope.report.json.path	optional	s3://my-bucket/path/to/json/report/dir	path to which SparkScope json report will be saved
spark.sparkscope.log.path	optional	s3://my-bucket/path/to/log/dir	path to which SparkScope logs will be saved
spark.sparkscope.log.level	optional	DEBUG, INFO, WARN, ERROR	logging level for SparkScope logs
spark.sparkscope.diagnostics.enabled	optional	true/false	set to false to disable submitting diagnostics, default=true.
spark.sparkscope.metrics.dir.driver	optional	s3://my-bucket/path/to/metrics	path to driver csv metrics relative to driver, defaults to "spark.metrics.conf.driver.sink.csv.directory" property value
spark.sparkscope.metrics.dir.executor	optional	s3://my-bucket/path/to/metrics	path to executor csv metrics relative to driver, defaults to "spark.metrics.conf.executor.sink.csv.directory" property value

Attaching SparkScope to Spark applications

Notes:

One can choose to put all spark.metrics.conf properties is a metrics.properties file
Using custom sink(SparkScopeCsvSink) requires adding jar to driver & executors and extending their classpaths.
--files(spark.files) option should be used
--jars(spark.jars) option will only make the Sink available for the driver

Storing metrics to S3

spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=s3://<bucket-name>/<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.region=<region> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--conf spark.sparkscope.report.html.path=s3://<bucket-name>/<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000

Storing metrics to Hadoop(hdfs/maprfs)

spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=hdfs://<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--conf spark.sparkscope.report.html.path=hdfs://<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000

Storing metrics to NFS/locally

spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--conf spark.sparkscope.report.html.path=<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000

Using metrics.properties file instead of spark.metrics.conf.* properties:

Instead of specifying spark.metrics.conf.* as separate properties, we can also specify them in metrics.properties file:

# Enable CsvSink for all instances by class name
*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink

# Polling period for the CsvSink
*.sink.csv.period=5

# Unit of the polling period for the CsvSink
*.sink.csv.unit=seconds

# Polling directory for CsvSink
*.sink.csv.directory=hdfs:///tmp/csv-metrics

# JVM SOURCE
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

And specifying path to metrics.properties file in spark-submit command:

spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar,./metrics.properties \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf=./metrics.properties \
--conf spark.sparkscope.report.html.path=hdfs://<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000

Running SparkScope as standalone app for running/finished Spark Application

Your application needs to have eventLog and metrics configured(but not the listener)

spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000

Running sparkscope as java-app

java \
-cp ./sparkscope-spark3-0.1.9-SNAPSHOT.jar:$(hadoop classpath) \
com.ucesys.sparkscope.SparkScopeApp \
--event-log <path-to-event-log> \
--html-path <path-to-html-report-dir> \
--json-path <path-to-json-report-dir> \
--log-path <path-to-log-dir> \
--log-level <logging level> \
--diagnostics <true/false> \
--region <aws-region>

SparkScope summary:

SparkScope analysis summary should be printed out to the console:

28/09/2023 01:20:22 INFO [SparkScope] SparkScope analysis took 0.052s
28/09/2023 01:20:22 INFO [SparkScope] 
     ____              __    ____
    / __/__  ___ _____/ /__ / __/_ ___  ___  ___
   _\ \/ _ \/ _ `/ __/  '_/_\ \/_ / _ \/ _ \/__/
  /___/ .__/\_,_/_/ /_/\_\/___/\__\_,_/ .__/\___/
     /_/                             /_/      v0.1.1

28/09/2023 01:20:22 INFO [SparkScope] Executor stats:
Executor heap size: 800MB
Max heap memory utilization by executor: 286MB(35.80%)
Average heap memory utilization by executor: 156MB(19.56%)
Max non-heap memory utilization by executor: 49MB
Average non-heap memory utilization by executor: 35MB

28/09/2023 01:20:22 INFO [SparkScope] Driver stats:
Driver heap size: 910
Max heap memory utilization by driver: 262MB(28.87%)
Average heap memory utilization by driver: 207MB(22.78%)
Max non-heap memory utilization by driver: 67MB
Average non-heap memory utilization by driver: 65MB

28/09/2023 01:20:22 INFO [SparkScope] Cluster Memory stats: 
Average Cluster heap memory utilization: 19.56% / 156MB
Max Cluster heap memory utilization: 35.80% / 286MB
heapGbHoursAllocated: 0.0033
heapGbHoursAllocated=(executorHeapSizeInGb(0.78125)*combinedExecutorUptimeInSec(15s))/3600
heapGbHoursWasted: 0.0006
heapGbHoursWasted=heapGbHoursAllocated(0.0033)*heapUtilization(0.1956)

28/09/2023 01:20:22 INFO [SparkScope] Cluster CPU stats: 
Total CPU utilization: 68.35%
coreHoursAllocated: 0.0042
coreHoursAllocated=(executorCores(1)*combinedExecutorUptimeInSec(15s))/3600
coreHoursWasted: 0.0029
coreHoursWasted=coreHoursAllocated(0.0042)*cpuUtilization(0.6835)

28/09/2023 01:20:22 INFO [SparkScope] Wrote HTML report file to /tmp/app-20230928132004-0012.html

ucesys / sparkscope

readme