uber-common / jvm-profiler

JVM Profiler Sending Metrics to Kafka, Console Output or Custom Reporter
Other
1.78k stars 342 forks source link

Spark Executor extraJavaOptions not working #42

Closed ramshankar1990 closed 5 years ago

ramshankar1990 commented 5 years ago

Hello,

I have been using the JVM profiler on a supercomputing cluster for a bit now. The spark.driver.extraJavaOptions string seems to work fine and the Console Reporter writes profiling information to the output log file. When I use spark.executor.extraJavaOptions string the profiler doesn't seem to work. Spark on the cluster is setup as Standalone and runs in client mode. I was hoping if you could help me with this.

The command I use to run the profiler is:- spark-submit --executor-cores 8 --driver-memory 20G --executor-memory 20G --conf spark.driver.extraJavaOptions=-javaagent:/home/rvenka21/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReport --conf spark.executor.extraJavaOptions=-javaagent:/home/rvenka21/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReport --class Main "tool to be profiled"

Log information:- Java Agent 1.0.0 premain args: reporter=com.uber.profiling.reporters.ConsoleOutputReporter 1554139518375 com.uber.profiling.Arguments: Got argument value for reporter: com.uber.profiling.reporters.ConsoleOutputReporter ConsoleOutputReporter - ProcessInfo: {"jvmInputArguments":"","role":"driver","jvmClassPath":"","epochMillis":1554139518509,"cmdline":"/usr/java/latest/bin/java -cp /home/rvenka21/mycluster.conf/spark/:/share/apps/compute/spark/spark-2.4.0-bin-hadoop2.6/jars/*:/home/rvenka21/mycluster.conf/ -Xmx20G -javaagent:/home/rvenka21/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=20G --conf spark.executor.extraJavaOptions=-javaagent:/home/rvenka21/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter --conf spark.driver.extraJavaOptions=-javaagent:/home/rvenka21/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter --class Main --executor-cores 8 --executor-memory 20G /oasis/projects/nsf/uic367/rvenka21/Genomics_EpiQuant/SPAEML/target/SPAEML-0.0.1-jar-with-dependencies.jar StepwiseModelSelection --epiq /user/input/SNP.200.Sample.300.Genotype.epiq -P /user/input/SNP.200.Sample.300.Phenotype.epiq -o /user/input/cluster_mode ","appId":null,"name":"11612@comet-03-02.sdsc.edu","host":"comet-03-02","processUuid":"bca1b961-e27a-42bd-ad57-bedbb8d4c095","agentVersion":"1.0.0","appClass":"Main","xmxBytes":21474836480,"appJar":null} 1554139519286 com.uber.profiling.AgentImpl: Finished one time profiler: com.uber.profiling.profilers.ProcessInfoProfiler@5427c60c

With Regards, Ram

hiboyang commented 5 years ago

Hi Ram, glad you are trying the JVM profiler! We run it with Spark cluster mode, and haven't tried standalone/client mode. Do you have any error logs so we could take a look? Also if you could figure this out, welcome to contribute any code change or fix as well!

ramshankar1990 commented 5 years ago

I have edited my post with the command I am running and the Console reporter's output to the log file. I hope this give more information. I can email you the entire log file too if you like.

hiboyang commented 5 years ago

Hi Ram, thanks for those commands! Could you paste some error log here so we could take a quick look?

ramshankar1990 commented 5 years ago

Hi,

The information from the error logs:-

TARTUP_MSG: build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1; compiled by 'jenkins' on 2014-11-13T21:10Z STARTUP_MSG: java = 1.8.0_60 ****/ 19/04/01 13:04:02 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT] 19/04/01 13:04:02 INFO namenode.NameNode: createNameNode [-format] 19/04/01 13:04:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/04/01 13:04:02 WARN common.Util: Path /scratch/rvenka21/22267467/namenode_data should be specified as a URI in configuration files. Please update hdfs configuration. 19/04/01 13:04:02 WARN common.Util: Path /scratch/rvenka21/22267467/namenode_data should be specified as a URI in configuration files. Please update hdfs configuration. 19/04/01 13:04:02 INFO namenode.FSNamesystem: No KeyProvider found. 19/04/01 13:04:02 INFO namenode.FSNamesystem: fsLock is fair:true 19/04/01 13:04:02 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000 19/04/01 13:04:02 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true 19/04/01 13:04:02 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000 19/04/01 13:04:02 INFO blockmanagement.BlockManager: The block deletion will start around 2019 Apr 01 13:04:02 19/04/01 13:04:02 INFO util.GSet: Computing capacity for map BlocksMap 19/04/01 13:04:02 INFO util.GSet: VM type = 64-bit 19/04/01 13:04:02 INFO util.GSet: 2.0% max memory 958.5 MB = 19.2 MB 19/04/01 13:04:02 INFO util.GSet: capacity = 2^21 = 2097152 entries 19/04/01 13:04:02 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false 19/04/01 13:04:02 INFO blockmanagement.BlockManager: defaultReplication = 3 19/04/01 13:04:02 INFO blockmanagement.BlockManager: maxReplication = 512 19/04/01 13:04:02 INFO blockmanagement.BlockManager: minReplication = 1 19/04/01 13:04:02 INFO blockmanagement.BlockManager: maxReplicationStreams = 2 19/04/01 13:04:02 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false 19/04/01 13:04:02 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000 19/04/01 13:04:02 INFO blockmanagement.BlockManager: encryptDataTransfer = false 19/04/01 13:04:02 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000 19/04/01 13:04:02 INFO namenode.FSNamesystem: fsOwner = rvenka21 (auth:SIMPLE) 19/04/01 13:04:02 INFO namenode.FSNamesystem: supergroup = supergroup 19/04/01 13:04:02 INFO namenode.FSNamesystem: isPermissionEnabled = true 19/04/01 13:04:02 INFO namenode.FSNamesystem: HA Enabled: false 19/04/01 13:04:02 INFO namenode.FSNamesystem: Append Enabled: true 19/04/01 13:04:03 INFO util.GSet: Computing capacity for map INodeMap 19/04/01 13:04:03 INFO util.GSet: VM type = 64-bit 19/04/01 13:04:03 INFO util.GSet: 1.0% max memory 958.5 MB = 9.6 MB 19/04/01 13:04:03 INFO util.GSet: capacity = 2^20 = 1048576 entries 19/04/01 13:04:03 INFO namenode.NameNode: Caching file names occuring more than 10 times 19/04/01 13:04:03 INFO util.GSet: Computing capacity for map cachedBlocks 19/04/01 13:04:03 INFO util.GSet: VM type = 64-bit 19/04/01 13:04:03 INFO util.GSet: 0.25% max memory 958.5 MB = 2.4 MB 19/04/01 13:04:03 INFO util.GSet: capacity = 2^18 = 262144 entries 19/04/01 13:04:03 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033 19/04/01 13:04:03 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0 19/04/01 13:04:03 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000 19/04/01 13:04:03 INFO namenode.FSNamesystem: Retry cache on namenode is enabled 19/04/01 13:04:03 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis 19/04/01 13:04:03 INFO util.GSet: Computing capacity for map NameNodeRetryCache 19/04/01 13:04:03 INFO util.GSet: VM type = 64-bit 19/04/01 13:04:03 INFO util.GSet: 0.029999999329447746% max memory 958.5 MB = 294.5 KB 19/04/01 13:04:03 INFO util.GSet: capacity = 2^15 = 32768 entries 19/04/01 13:04:03 INFO namenode.NNConf: ACLs enabled? false 19/04/01 13:04:03 INFO namenode.NNConf: XAttrs enabled? true 19/04/01 13:04:03 INFO namenode.NNConf: Maximum size of an xattr: 16384 19/04/01 13:04:03 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1621996323-198.202.114.59-1554149043299 19/04/01 13:04:03 INFO common.Storage: Storage directory /scratch/rvenka21/22267467/namenode_data has been successfully formatted. 19/04/01 13:04:03 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 19/04/01 13:04:03 INFO util.ExitUtil: Exiting with status 0 19/04/01 13:04:03 INFO namenode.NameNode: SHUTDOWN_MSG: /**** SHUTDOWN_MSG: Shutting down NameNode at comet-03-34.sdsc.edu/198.202.114.59 ****/ 19/04/01 13:04:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/04/01 13:04:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/04/01 13:04:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/04/01 13:04:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/04/01 13:04:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/04/01 13:04:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable put: `.': No such file or directory Exception in thread "block-manager-slave-async-thread-pool-60" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1018) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 1554149142287 Running java agent shutdown 1554149142287 Running periodic profiler (last run): com.uber.profiling.profilers.CpuAndMemoryProfiler@5e57643e 1554149142289 Ran periodic profiler (last run): com.uber.profiling.profilers.CpuAndMemoryProfiler@5e57643e 1554149142289 Closing reporter com.uber.profiling.reporters.ConsoleOutputReporter@5e1edadf 1554149142289 Closed reporter com.uber.profiling.reporters.ConsoleOutputReporter@5e1edadf 19/04/01 13:07:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/04/01 13:07:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19/04/01 13:07:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

ramshankar1990 commented 5 years ago

I am also attaching some information from the output logs:-

mySpark: Starting worker on comet-22-49.sdsc.edu: /share/apps/compute/spark/spark-2.4.0-bin-hadoop2.6/sbin/spark-daemon.sh --config /home/rvenka21/mycluster.conf/spark start org.apache.spark.deploy.worker.Worker 1 spark://comet-03-34.sdsc.edu:7077 starting org.apache.spark.deploy.worker.Worker, logging to /scratch/rvenka21/22267467/logs/spark-rvenka21-org.apache.spark.deploy.worker.Worker-1-comet-22-49.sdsc.edu.out mySpark: Starting worker on comet-26-55.sdsc.edu: /share/apps/compute/spark/spark-2.4.0-bin-hadoop2.6/sbin/spark-daemon.sh --config /home/rvenka21/mycluster.conf/spark start org.apache.spark.deploy.worker.Worker 1 spark://comet-03-34.sdsc.edu:7077 starting org.apache.spark.deploy.worker.Worker, logging to /scratch/rvenka21/22267467/logs/spark-rvenka21-org.apache.spark.deploy.worker.Worker-1-comet-26-55.sdsc.edu.out /oasis/projects/nsf/uic367/rvenka21 Java Agent 1.0.0 premain args: reporter=com.uber.profiling.reporters.ConsoleOutputReporter 1554149077196 com.uber.profiling.Arguments: Got argument value for reporter: com.uber.profiling.reporters.ConsoleOutputReporter ConsoleOutputReporter - ProcessInfo: {"jvmInputArguments":"","role":"driver","jvmClassPath":"","epochMillis":1554149077334,"cmdline":"/usr/java/latest/bin/java -cp /home/rvenka21/mycluster.conf/spark/:/share/apps/compute/spark/spark-2.4.0-bin-hadoop2.6/jars/*:/home/rvenka21/mycluster.conf/ -Xmx20G -javaagent:/home/rvenka21/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=20G --conf spark.executor.extraJavaOptions=-javaagent:/home/rvenka21/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter --conf spark.driver.extraJavaOptions=-javaagent:/home/rvenka21/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter --class Main --executor-cores 8 --executor-memory 20G /oasis/projects/nsf/uic367/rvenka21/Genomics_EpiQuant/SPAEML/target/SPAEML-0.0.1-jar-with-dependencies.jar StepwiseModelSelection --epiq /user/input/SNP.200.Sample.300.Genotype.epiq -P /user/input/SNP.200.Sample.300.Phenotype.epiq -o /user/input/cluster_mode ","appId":null,"name":"6816@comet-03-34.sdsc.edu","host":"comet-03-34","processUuid":"76c9fb63-beff-458b-a6f3-a8a3ac9eff5b","agentVersion":"1.0.0","appClass":"Main","xmxBytes":21474836480,"appJar":null} 1554149078124 com.uber.profiling.AgentImpl: Finished one time profiler: com.uber.profiling.profilers.ProcessInfoProfiler@5427c60c ConsoleOutputReporter - CpuAndMemory: {"role":"driver","nonHeapMemoryTotalUsed":1.1647592E7,"bufferPools":[{"totalCapacity":8191,"name":"direct","count":1,"memoryUsed":8191},{"totalCapacity":0,"name":"mapped","count":0,"memoryUsed":0}],"heapMemoryTotalUsed":9.5180064E7,"vmRSS":138674176,"epochMillis":1554149078129,"nonHeapMemoryCommitted":1.409024E7,"heapMemoryCommitted":2.024275968E9,"memoryPools":[{"peakUsageMax":251658240,"usageMax":251658240,"peakUsageUsed":506880,"name":"Code Cache","peakUsageCommitted":2555904,"usageUsed":506880,"type":"Non-heap memory","usageCommitted":2555904},{"peakUsageMax":-1,"usageMax":-1,"peakUsageUsed":10024312,"name":"Metaspace","peakUsageCommitted":10354688,"usageUsed":10024312,"type":"Non-heap memory","usageCommitted":10354688},{"peakUsageMax":1073741824,"usageMax":1073741824,"peakUsageUsed":1118112,"name":"Compressed Class Space","peakUsageCommitted":1179648,"usageUsed":1118112,"type":"Non-heap memory","usageCommitted":1179648},{"peakUsageMax":6982991872,"usageMax":6982991872,"peakUsageUsed":95180064,"name":"PS Eden Space","peakUsageCommitted":528482304,"usageUsed":95180064,"type":"Heap memory","usageCommitted":528482304},{"peakUsageMax":87556096,"usageMax":87556096,"peakUsageUsed":0,"name":"PS Survivor Space","peakUsageCommitted":87556096,"usageUsed":0,"type":"Heap memory","usageCommitted":87556096},{"peakUsageMax":14316732416,"usageMax":14316732416,"peakUsageUsed":0,"name":"PS Old Gen","peakUsageCommitted":1408237568,"usageUsed":0,"type":"Heap memory","usageCommitted":1408237568}],"processCpuLoad":2.119160531952671E-8,"systemCpuLoad":0.5732152138477711,"processCpuTime":1110000000,"vmHWM":138674176,"appId":null,"name":"6816@comet-03-34.sdsc.edu","host":"comet-03-34","processUuid":"76c9fb63-beff-458b-a6f3-a8a3ac9eff5b","gc":[{"collectionTime":0,"name":"PS Scavenge","collectionCount":0},{"collectionTime":0,"name":"PS MarkSweep","collectionCount":0}]} 1554149078140 com.uber.profiling.AgentImpl: Ran periodic profiler (first run): com.uber.profiling.profilers.CpuAndMemoryProfiler@5e57643e 1554149078146 com.uber.profiling.AgentImpl: Scheduled profiler com.uber.profiling.profilers.CpuAndMemoryProfiler@5e57643e with interval 60000 millis ConsoleOutputReporter - CpuAndMemory: {"role":"driver","nonHeapMemoryTotalUsed":1.1881584E7,"bufferPools":[{"totalCapacity":8192,"name":"direct","count":1,"memoryUsed":8192},{"totalCapacity":0,"name":"mapped","count":0,"memoryUsed":0}],"heapMemoryTotalUsed":1.05749736E8,"vmRSS":141107200,"epochMillis":1554149078149,"nonHeapMemoryCommitted":1.4352384E7,"heapMemoryCommitted":2.024275968E9,"memoryPools":[{"peakUsageMax":251658240,"usageMax":251658240,"peakUsageUsed":510592,"name":"Code Cache","peakUsageCommitted":2555904,"usageUsed":510592,"type":"Non-heap memory","usageCommitted":2555904},{"peakUsageMax":-1,"usageMax":-1,"peakUsageUsed":10220920,"name":"Metaspace","peakUsageCommitted":10616832,"usageUsed":10220920,"type":"Non-heap memory","usageCommitted":10616832},{"peakUsageMax":1073741824,"usageMax":1073741824,"peakUsageUsed":1150072,"name":"Compressed Class Space","peakUsageCommitted":1179648,"usageUsed":1150072,"type":"Non-heap memory","usageCommitted":1179648},{"peakUsageMax":6982991872,"usageMax":6982991872,"peakUsageUsed":105749736,"name":"PS Eden Space","peakUsageCommitted":528482304,"usageUsed":105749736,"type":"Heap memory","usageCommitted":528482304},{"peakUsageMax":87556096,"usageMax":87556096,"peakUsageUsed":0,"name":"PS Survivor Space","peakUsageCommitted":87556096,"usageUsed":0,"type":"Heap memory","usageCommitted":87556096},{"peakUsageMax":14316732416,"usageMax":14316732416,"peakUsageUsed":0,"name":"PS Old Gen","peakUsageCommitted":1408237568,"usageUsed":0,"type":"Heap memory","usageCommitted":1408237568}],"processCpuLoad":0.0625,"systemCpuLoad":0.0851063829787234,"processCpuTime":1130000000,"vmHWM":141107200,"appId":null,"name":"6816@comet-03-34.sdsc.edu","host":"comet-03-34","processUuid":"76c9fb63-beff-458b-a6f3-a8a3ac9eff5b","gc":[{"collectionTime":0,"name":"PS Scavenge","collectionCount":0},{"collectionTime":0,"name":"PS MarkSweep","collectionCount":0}]} 2019-04-01 13:04:38 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2019-04-01 13:04:39 INFO SparkContext:54 - Running Spark version 2.4.0 2019-04-01 13:04:39 INFO SparkContext:54 - Submitted application: Main .............. .............. .............. .............. .............. 1554149142287 Running java agent shutdown 1554149142287 Running periodic profiler (last run): com.uber.profiling.profilers.CpuAndMemoryProfiler@5e57643e ConsoleOutputReporter - CpuAndMemory: {"role":"driver","nonHeapMemoryTotalUsed":1.09398352E8,"bufferPools":[{"totalCapacity":1683671,"name":"direct","count":27,"memoryUsed":1683672},{"totalCapacity":0,"name":"mapped","count":0,"memoryUsed":0}],"heapMemoryTotalUsed":1.466470224E9,"vmRSS":1928953856,"epochMillis":1554149142289,"nonHeapMemoryCommitted":1.11607808E8,"heapMemoryCommitted":3.556245504E9,"memoryPools":[{"peakUsageMax":251658240,"usageMax":251658240,"peakUsageUsed":27387968,"name":"Code Cache","peakUsageCommitted":28180480,"usageUsed":27374144,"type":"Non-heap memory","usageCommitted":28180480},{"peakUsageMax":-1,"usageMax":-1,"peakUsageUsed":73057368,"name":"Metaspace","peakUsageCommitted":74211328,"usageUsed":73057368,"type":"Non-heap memory","usageCommitted":74211328},{"peakUsageMax":1073741824,"usageMax":1073741824,"peakUsageUsed":8966840,"name":"Compressed Class Space","peakUsageCommitted":9216000,"usageUsed":8966840,"type":"Non-heap memory","usageCommitted":9216000},{"peakUsageMax":6982991872,"usageMax":6900678656,"peakUsageUsed":1125620104,"name":"PS Eden Space","peakUsageCommitted":1271922688,"usageUsed":1125620104,"type":"Heap memory","usageCommitted":1271922688},{"peakUsageMax":171442176,"usageMax":99614720,"peakUsageUsed":99205336,"name":"PS Survivor Space","peakUsageCommitted":171442176,"usageUsed":99205336,"type":"Heap memory","usageCommitted":99614720},{"peakUsageMax":14316732416,"usageMax":14316732416,"peakUsageUsed":241821552,"name":"PS Old Gen","peakUsageCommitted":2184708096,"usageUsed":241821552,"type":"Heap memory","usageCommitted":2184708096}],"processCpuLoad":0.010472258584231195,"systemCpuLoad":0.011177122142785219,"processCpuTime":40470000000,"vmHWM":1928978432,"appId":null,"name":"6816@comet-03-34.sdsc.edu","host":"comet-03-34","processUuid":"76c9fb63-beff-458b-a6f3-a8a3ac9eff5b","gc":[{"collectionTime":90,"name":"PS Scavenge","collectionCount":8},{"collectionTime":127,"name":"PS MarkSweep","collectionCount":3}]} 1554149142289 Ran periodic profiler (last run): com.uber.profiling.profilers.CpuAndMemoryProfiler@5e57643e 1554149142289 Closing reporter com.uber.profiling.reporters.ConsoleOutputReporter@5e1edadf 1554149142289 Closed reporter com.uber.profiling.reporters.ConsoleOutputReporter@5e1edadf ConsoleOutputReporter - CpuAndMemory: {"role":"driver","nonHeapMemoryTotalUsed":1.09350728E8,"bufferPools":[{"totalCapacity":1691863,"name":"direct","count":28,"memoryUsed":1691864},{"totalCapacity":0,"name":"mapped","count":0,"memoryUsed":0}],"heapMemoryTotalUsed":1.497689328E9,"vmRSS":1937358848,"epochMillis":1554149198147,"nonHeapMemoryCommitted":1.1173888E8,"heapMemoryCommitted":3.556245504E9,"memoryPools":[{"peakUsageMax":251658240,"usageMax":251658240,"peakUsageUsed":27481280,"name":"Code Cache","peakUsageCommitted":28311552,"usageUsed":27240512,"type":"Non-heap memory","usageCommitted":28311552},{"peakUsageMax":-1,"usageMax":-1,"peakUsageUsed":73129624,"name":"Metaspace","peakUsageCommitted":74211328,"usageUsed":73129624,"type":"Non-heap memory","usageCommitted":74211328},{"peakUsageMax":1073741824,"usageMax":1073741824,"peakUsageUsed":8980592,"name":"Compressed Class Space","peakUsageCommitted":9216000,"usageUsed":8980592,"type":"Non-heap memory","usageCommitted":9216000},{"peakUsageMax":6982991872,"usageMax":6900678656,"peakUsageUsed":1156662440,"name":"PS Eden Space","peakUsageCommitted":1271922688,"usageUsed":1156662440,"type":"Heap memory","usageCommitted":1271922688},{"peakUsageMax":171442176,"usageMax":99614720,"peakUsageUsed":99205336,"name":"PS Survivor Space","peakUsageCommitted":171442176,"usageUsed":99205336,"type":"Heap memory","usageCommitted":99614720},{"peakUsageMax":14316732416,"usageMax":14316732416,"peakUsageUsed":241821552,"name":"PS Old Gen","peakUsageCommitted":2184708096,"usageUsed":241821552,"type":"Heap memory","usageCommitted":2184708096}],"processCpuLoad":2.7606789778026486E-4,"systemCpuLoad":9.550385746049274E-4,"processCpuTime":40840000000,"vmHWM":1954996224,"appId":null,"name":"6816@comet-03-34.sdsc.edu","host":"comet-03-34","processUuid":"76c9fb63-beff-458b-a6f3-a8a3ac9eff5b","gc":[{"collectionTime":90,"name":"PS Scavenge","collectionCount":8},{"collectionTime":127,"name":"PS MarkSweep","collectionCount":3}]}

ramshankar1990 commented 5 years ago

I hope the information above was sufficient and gives you a better idea of the issue that Im facing.

hiboyang commented 5 years ago

Hi Ram, I see the exception is OutOfMemoryError. Looks you need to give more memory to your spark application?

ramshankar1990 commented 5 years ago

Hi, That issue was because of the spark-submit parameters that I used for my run. Some executors were being dropped because of the Out of Memory issue. I changed the parameters and I do not get the Out of Memory error anymore.

As you can see in my log file I see Console Reporter has the line " {"role":"driver","nonHeapMemoryTotalUsed":1.09350728E8,"bufferPools":....."

but there is nothing being reported for the executors even though I provide the console reporter Java agent to the spark.executor.extraJavaOptions.

hiboyang commented 5 years ago

Do you have log or console output from your executor? We could check whether jvm profiler is enabled from there.

ramshankar1990 commented 5 years ago

All the information that I have in my log files related to the JVM profiler is in above two comments. From what I can see in the logs files the JVM Profiler is not being placed on the executors but it is being recognized by the driver.

hiboyang commented 5 years ago

Yeah, understand the issue. I mean if you could find and paste logs from your spark executors, we might be able to take a look to see what is going on there.

ramshankar1990 commented 5 years ago

Hi, the admins showed me how to access the executor's stdout and I can see the profiler running there.

hiboyang commented 5 years ago

That is great to hear! Thanks for letting us know!

ramshankar1990 commented 5 years ago

Thank you for your quick responses and help so far. I just have one last question unrelated to the topic of discussion. I am using the Duration Profiling flag to check how many times a function is called and how long does it run for. The information I see in the logs I see in relation to Duration Profiling is this:

1554397366559 com.uber.profiling.transformers.JavaAgentFileTransformer: Transforming class: spaeml.SPAEML$$anonfun$2

There is not information on what the start and end time is for Duration Profiling similar to the StackTrace option. Also may I know if I can use the profiler to get information on cross-node communication as well.

hiboyang commented 5 years ago

What command line/arguments did you use for Duration Profiling?

ramshankar1990 commented 5 years ago

The following is how I use the Duration Profiling parameter in my spark-submit command.

spark.executor.extraJavaOptions=-javaagent:/home/rvenka21/jvm-profiler-1.0.0.jar=reporter=com.uber.profiling.reporters.ConsoleOutputReporter,metricInterval=5000,durationProfiling=spaeml.SPAEML.performSteps,ioProfiling=true,sampleInterval=5000

spaeml.SPAEML.performSteps is the function that I am trying to profile.

ramshankar1990 commented 5 years ago

May I know if I am using the argument as expected.

ramshankar1990 commented 5 years ago

Hi, I ran my script in Local mode and used a java profiler in order to verify if the function is being called multiple times. I looked at the logs and I found that the function is found by the Console reporter. These are the instances where I see the tool finding the function.

Java Agent 1.0.0 premain args: reporter=com.uber.profiling.reporters.ConsoleOutputReporter,metricInterval=5000,durationProfiling=spaeml.SPAEML.performSteps,ioProfiling=true,sampleInterval=5000 1554397358234 com.uber.profiling.Arguments: Got argument value for reporter: com.uber.profiling.reporters.ConsoleOutputReporter 1554397358234 com.uber.profiling.Arguments: Got argument value for metricInterval: 5000 1554397358234 com.uber.profiling.Arguments: Got argument value for sampleInterval: 5000 1554397358236 com.uber.profiling.Arguments: Got argument value for durationProfiling: spaeml.SPAEML.performSteps 1554397358236 com.uber.profiling.Arguments: Got argument value for ioProfiling: true

1554397366557 com.uber.profiling.transformers.JavaAgentFileTransformer: Transformed class method: spaeml.SPAEML$.performSteps(org.apache.spark.SparkContext,org.apache.spark.rdd.RDD,org.apache.spark.broadcast.Broadcast,java.lang.String,spaeml.StepCollections,double,statistics.OLSRegression,int), durationProfiling: true, argumentProfiling: []

Apart from the above cases the only instances of duration profiling I see are as mentioned below :- 1554397366579 com.uber.profiling.transformers.JavaAgentFileTransformer: Transforming class: spaeml.SPAEML$$anonfun$1

There are similar instances of anonfun that are reported by the tool but I just have one line here for an example.

hiboyang commented 5 years ago

Hi Ram, these logs showed the profiler instrumented that method when its class is loaded into the jvm process. Would you check whether these methods are called during your application running? Also did you check both driver and executor logs?

ramshankar1990 commented 5 years ago

Hi, I used the other java profiler to confirm the if the method is being called when the application runs. I did check the executor logs. The info above is from one of the executor logs. Another way I was able to confirm that the function is being called multiple times is by performing StackTrace sampling. I can see multiple instances on other executor logs that shows profiling information on sub-functions that are called inside the function I am trying to profile. For example:-

ConsoleOutputReporter - Stacktrace: {"role":"executor","stacktrace":["com.github.fommil.netlib.F2jBLAS.dgemm","breeze.linalg.operators.DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$.apply","breeze.linalg.operators.DenseMatrixMultiplyStuff$implOpMulMatrix_DMD_DMD_eq_DMD$.apply","breeze.linalg.ImmutableNumericOps$class.$times","breeze.linalg.DenseMatrix.$times","statistics.OLSRegression.","spaeml.SPAEML$.spaeml$SPAEML$$mapFunction$1","spaeml.SPAEML$$anonfun$11.apply","spaeml.SPAEML$$anonfun$11.apply","scala.collection.Iterator$$anon$11.next","scala.collection.Iterator$$anon$13.hasNext","scala.collection.Iterator$$anon$11.hasNext","scala.collection.Iterator$class.foreach","scala.collection.AbstractIterator.foreach","scala.collection.TraversableOnce$class.reduceLeft","scala.collection.AbstractIterator.reduceLeft","org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply","org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply","org.apache.spark.SparkContext$$anonfun$36.apply","org.apache.spark.SparkContext$$anonfun$36.apply","org.apache.spark.scheduler.ResultTask.runTask","org.apache.spark.scheduler.Task.run","org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply","org.apache.spark.util.Utils$.tryWithSafeFinally","org.apache.spark.executor.Executor$TaskRunner.run","java.util.concurrent.ThreadPoolExecutor.runWorker","java.util.concurrent.ThreadPoolExecutor$Worker.run","java.lang.Thread.run"],"endEpoch":1554397374261,"appId":null,"host":"comet-19-07.sdsc.edu","name":"6381@comet-19-07.sdsc.edu","processUuid":"096ea0b6-777d-424d-9e4b-b13d22740468","threadState":"RUNNABLE","count":1,"startEpoch":1554397369261,"threadName":"Executor task launch worker for task 373"}

Map Function (the line in bold) is a sub-function that gets called inside the function I am trying to profile.

ramshankar1990 commented 5 years ago

Please let me know if I can provide you with more information. I was told by admins to email you the logs in case you wanted to look at it for further troubleshooting. I was told not to post the logs on a public forum.

hiboyang commented 5 years ago

Hi Ram, your profiler argument is "spaeml.SPAEML.performSteps", but your stacktrace log shows "spaeml.SPAEML$.spaeml$SPAEML$$mapFunction$1". They are different (the latter having $ in the end of SPAEML$). You could try to tweak the profiler argument by adding the $ sign.

ramshankar1990 commented 5 years ago

Hi, I am able to profile the method after adding the $ sign. The output I see from the Console Reporter is as follows:- ConsoleOutputReporter - MethodDuration: {"role":"driver","metricName":"duration.count","processName":"12140@comet-17-63.sdsc.edu","appId":null,"host":"comet-17-63","processUuid":"f57d11a0-3310-42f6-b415-1b0872fa5799","metricValue":1.0,"methodName":"performSteps","className":"spaeml.SPAEML$","epochMillis":1555435323752}

I greatly appreciate your help. Thank you.

hiboyang commented 5 years ago

That is great! Glad it is working for you now :) Also thanks for letting us know!

felixcheung commented 5 years ago

should we add to documentation?

hiboyang commented 5 years ago

The issue is here the configured value in durationProfiling does not match the actual method name. The documentation already mentioned configuring a method name in durationProfiling. Thus we should be good here.