SparkScope is a monitoring and profiling tool for Spark Applications. It allows to review resource allocation, utiization, and demand timeline as it took place during Spark application execution. SparkScope presents information using visual charts which allow to
It is implemented as SparkListener which means that it runs inside driver and listens for Spark events. SparkScope utilizes csv metrics produced by custom SparkScopeCsvSink and supports multiple storage types.
SparkScope produces reports in the following formats
SparkScope HTML reports contains the following features:
Stats:
Charts:
total % of utilized cpu/heap charts
total utilization vs allocation cpu/heap charts
number of tasks vs CPU capacity and number of executors
heap and non-heap charts for executors
heap and non-heap charts for driver
Warnings:
Spark 2 (spark2 branch) | Spark 3 (main branch) | |
---|---|---|
Scala version | 2.11.12 | 2.12.18 |
compatible JDK versions | 7, 8 | 8, 11, 17 |
compatible Spark versions | 2.3, 2.4 | 3.2, 3.3, 3.4, 3.5 |
parameter | type | sample values | description |
---|---|---|---|
spark.extraListeners | mandatory | com.ucesys.sparkscope.SparkScopeJobListener | Spark listener class |
spark.metrics.conf.driver.source.jvm.class | mandatory | org.apache.spark.metrics.source.JvmSource | jvm metrics source for driver |
spark.metrics.conf.executor.source.jvm.class | mandatory | org.apache.spark.metrics.source.JvmSource | jvm metrics source for executor |
spark.metrics.conf.*.sink.csv.class | mandatory | org.apache.spark.metrics.sink.SparkScopeCsvSink | csv sink class |
spark.metrics.conf.*.sink.csv.period | mandatory | 5 | period of metrics spill |
spark.metrics.conf.*.sink.csv.unit | mandatory | seconds | unit of period of metrics spill |
spark.metrics.conf.*.sink.csv.directory | mandatory | s3://my-bucket/path/to/metrics | path to metrics directory, can be s3,hdfs,maprfs,local |
spark.metrics.conf.*.sink.csv.region | optional | us-east-1 | aws region, required for s3 storage |
spark.metrics.conf.*.sink.csv.appName | optional | MyApp | application name, also used for grouping metrics |
spark.sparkscope.report.html.path | optional | s3://my-bucket/path/to/html/report/dir | path to which SparkScope html report will be saved |
spark.sparkscope.report.json.path | optional | s3://my-bucket/path/to/json/report/dir | path to which SparkScope json report will be saved |
spark.sparkscope.log.path | optional | s3://my-bucket/path/to/log/dir | path to which SparkScope logs will be saved |
spark.sparkscope.log.level | optional | DEBUG, INFO, WARN, ERROR | logging level for SparkScope logs |
spark.sparkscope.diagnostics.enabled | optional | true/false | set to false to disable submitting diagnostics, default=true. |
spark.sparkscope.metrics.dir.driver | optional | s3://my-bucket/path/to/metrics | path to driver csv metrics relative to driver, defaults to "spark.metrics.conf.driver.sink.csv.directory" property value |
spark.sparkscope.metrics.dir.executor | optional | s3://my-bucket/path/to/metrics | path to executor csv metrics relative to driver, defaults to "spark.metrics.conf.executor.sink.csv.directory" property value |
Notes:
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=s3://<bucket-name>/<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.region=<region> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--conf spark.sparkscope.report.html.path=s3://<bucket-name>/<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=hdfs://<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--conf spark.sparkscope.report.html.path=hdfs://<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--conf spark.sparkscope.report.html.path=<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
Instead of specifying spark.metrics.conf.* as separate properties, we can also specify them in metrics.properties file:
# Enable CsvSink for all instances by class name
*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink
# Polling period for the CsvSink
*.sink.csv.period=5
# Unit of the polling period for the CsvSink
*.sink.csv.unit=seconds
# Polling directory for CsvSink
*.sink.csv.directory=hdfs:///tmp/csv-metrics
# JVM SOURCE
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
And specifying path to metrics.properties file in spark-submit command:
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar,./metrics.properties \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.extraListeners=com.ucesys.sparkscope.SparkScopeJobListener \
--conf spark.metrics.conf=./metrics.properties \
--conf spark.sparkscope.report.html.path=hdfs://<path-to-html-report-dir> \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
Your application needs to have eventLog and metrics configured(but not the listener)
spark-submit \
--master yarn \
--files ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--driver-class-path ./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.executor.extraClassPath=./sparkscope-spark3-0.1.9-SNAPSHOT.jar \
--conf spark.metrics.conf.driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource \
--conf spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.SparkScopeCsvSink \
--conf spark.metrics.conf.*.sink.csv.period=5 \
--conf spark.metrics.conf.*.sink.csv.unit=seconds \
--conf spark.metrics.conf.*.sink.csv.directory=<path-to-metrics-dir> \
--conf spark.metrics.conf.*.sink.csv.appName=My-App \
--class org.apache.spark.examples.SparkPi \
./spark-examples_2.10-1.1.1.jar 5000
Running sparkscope as java-app
java \
-cp ./sparkscope-spark3-0.1.9-SNAPSHOT.jar:$(hadoop classpath) \
com.ucesys.sparkscope.SparkScopeApp \
--event-log <path-to-event-log> \
--html-path <path-to-html-report-dir> \
--json-path <path-to-json-report-dir> \
--log-path <path-to-log-dir> \
--log-level <logging level> \
--diagnostics <true/false> \
--region <aws-region>
SparkScope analysis summary should be printed out to the console:
28/09/2023 01:20:22 INFO [SparkScope] SparkScope analysis took 0.052s
28/09/2023 01:20:22 INFO [SparkScope]
____ __ ____
/ __/__ ___ _____/ /__ / __/_ ___ ___ ___
_\ \/ _ \/ _ `/ __/ '_/_\ \/_ / _ \/ _ \/__/
/___/ .__/\_,_/_/ /_/\_\/___/\__\_,_/ .__/\___/
/_/ /_/ v0.1.1
28/09/2023 01:20:22 INFO [SparkScope] Executor stats:
Executor heap size: 800MB
Max heap memory utilization by executor: 286MB(35.80%)
Average heap memory utilization by executor: 156MB(19.56%)
Max non-heap memory utilization by executor: 49MB
Average non-heap memory utilization by executor: 35MB
28/09/2023 01:20:22 INFO [SparkScope] Driver stats:
Driver heap size: 910
Max heap memory utilization by driver: 262MB(28.87%)
Average heap memory utilization by driver: 207MB(22.78%)
Max non-heap memory utilization by driver: 67MB
Average non-heap memory utilization by driver: 65MB
28/09/2023 01:20:22 INFO [SparkScope] Cluster Memory stats:
Average Cluster heap memory utilization: 19.56% / 156MB
Max Cluster heap memory utilization: 35.80% / 286MB
heapGbHoursAllocated: 0.0033
heapGbHoursAllocated=(executorHeapSizeInGb(0.78125)*combinedExecutorUptimeInSec(15s))/3600
heapGbHoursWasted: 0.0006
heapGbHoursWasted=heapGbHoursAllocated(0.0033)*heapUtilization(0.1956)
28/09/2023 01:20:22 INFO [SparkScope] Cluster CPU stats:
Total CPU utilization: 68.35%
coreHoursAllocated: 0.0042
coreHoursAllocated=(executorCores(1)*combinedExecutorUptimeInSec(15s))/3600
coreHoursWasted: 0.0029
coreHoursWasted=coreHoursAllocated(0.0042)*cpuUtilization(0.6835)
28/09/2023 01:20:22 INFO [SparkScope] Wrote HTML report file to /tmp/app-20230928132004-0012.html