Open yzhan-te opened 5 years ago
Hi,
The script should collect metrics regardless of exit code. Please try the following:
--conf spark.streaming.stopGracefullyOnShutdown=true
ps -ef | grep spark | grep java | grep -v grep | awk '{print $2}' | xargs kill -SIGTERM
)See if metrics are collected this way. If the above doesn't help, please attach the output from the script here.
Thanks!
Hi,
Thanks for replying! This it the command I'm using:
/usr/local/bin/spark-submit-flamegraph --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.2,com.thesamet.scalapb:scalapb-runtime_2.11:0.9.0,redis.clients:jedis:3.1.0,com.amazonaws:aws-java-sdk:1.7.4,com.typesafe:config:1.3.4,com.etsy:statsd-jvm-profiler:2.0.0 --conf spark.streaming.stopGracefullyOnShutdown=true --master yarn --deploy-mode cluster --class com.package.SparkApp spark-scala_2.11-1.0-SNAPSHOT.jar cluster
And I am getting the following when stopping the app:
19/08/09 20:20:06 INFO ShutdownHookManager: Shutdown hook called
19/08/09 20:20:06 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dca35798-2712-4f86-b376-8463cb5d38fa
19/08/09 20:20:07 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-5e901536-9a80-4d27-98bf-6892d0ad6429
[2019-08-09T20:20:07,136864792+0000] Spark has exited with bad exit code (130)
[2019-08-09T20:20:07,154308000+0000] Collecting profiling metrics
[2019-08-09T20:20:07,524389370+0000] No profiling metrics were recorded!
Also if I kill it with this command:
ps -ef | grep spark | grep java | grep -v grep | grep -v zeppelin | awk '{print $ 2}' | sudo xargs kill -SIGTERM
Then script doesn't even print anything. It becomes just:
19/08/09 20:20:06 INFO ShutdownHookManager: Shutdown hook called
19/08/09 20:20:06 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dca35798-2712-4f86-b376-8463cb5d38fa
19/08/09 20:20:07 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-5e901536-9a80-4d27-98bf-6892d0ad6429
Can you please try the latest version from master? (f66976e70)
Should be fixed with latest master.
Hi there,
I tried the latest version and it's still not working. When I exit the job, it was in the state of "Collecting profiling metrics" for about 5 seconds and ended with "No profiling metrics were recorded!". Are there any specific commands that I need to put in my code to get it collect metrics? Also I am running this on a EMR cluster with yarn-cluster
mode. Do I need to do any changes to the worker machines?
Thanks!
Hi @yzhan-te,
No special configuration is needed, it's added automatically by the script. What can be checked is:
-javaagent:statsd-jvm-profiler.jar=server=<server>,port=<port>,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark
. Alternatively, you can validate that by looking at Java command line parameters: ps -ef | grep java
.server=
and port=
in the above configuration contain reachable hostname and port. The IP address is detected by the spark-submit-flamegraph
script, and the port number is chosen randomly. These host and port are where the script is running, and all Spark components are reporting metrics through it.Please report back if something is not as described above. Thanks!
Yep the settings are there:
-arg --driver-java-options --arg -javaagent:/home/hadoop/.spark-flamegraph/statsd-jvm-profiler.jar=server=<my_ip>,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --properties-file
The port is reachable too. Are there any other things you want me to look at specifically?
Can you look at the Java process command line (ps -ef | grep java
) and see whether there are these Java properties?
Looks like it's also there:
hadoop 7290 7160 51 21:12 pts/0 00:00:27 /etc/alternatives/jre/bin/java -cp /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf/:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/spark/conf/:/usr/lib/spark/jars/*:/etc/hadoop/conf/ org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode cluster --conf spark.memory.storageFraction=0.1 --conf spark.memory.fraction=0.9 --conf spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=<myip>,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.streaming.stopGracefullyOnShutdown=true --class com.thousandeyes.moneta.SparkApp --jars /home/hadoop/.spark-flamegraph/statsd-jvm-profiler.jar --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.2,com.thesamet.scalapb:scalapb-runtime_2.11:0.9.0,redis.clients:jedis:3.1.0,com.typesafe:config:1.3.4 --num-executors 4 --executor-cores 4 --executor-memory 3GB spark-scala_2.11-1.0-SNAPSHOT.jar cluster --driver-java-options -javaagent:/home/hadoop/.spark-flamegraph/statsd-jvm-profiler.jar=server=<myip>,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=sparkhadoop 7290 7160 51 21:12 pts/0 00:00:27 /etc/alternatives/jre/bin/java -cp /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf/:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/usr/lib/spark/conf/:/usr/lib/spark/jars/*:/etc/hadoop/conf/ org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode cluster --conf spark.memory.storageFraction=0.1 --conf spark.memory.fraction=0.9 --conf spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=<myip>,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.streaming.stopGracefullyOnShutdown=true --class com.package.SparkApp --jars /home/hadoop/.spark-flamegraph/statsd-jvm-profiler.jar --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.2,com.thesamet.scalapb:scalapb-runtime_2.11:0.9.0,redis.clients:jedis:3.1.0,com.typesafe:config:1.3.4 --num-executors 4 --executor-cores 4 --executor-memory 3GB spark-scala_2.11-1.0-SNAPSHOT.jar cluster --driver-java-options -javaagent:/home/hadoop/.spark-flamegraph/statsd-jvm-profiler.jar=server=<myip>,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark
This is the Spark client application command line. What's interesting is the driver and executor processes themselves. Could you get them as well? Thanks!
Hello,
Is there a way to setup this tool to collect metrics for Spark Structured Streaming? Currently we have a spark job that pulls Kafka constantly and we have to manually shut it down every time. If I do the manual shutdown, the shell script will complain that the return code is bad and no metrics has been collected.
Thanks!