spektom / spark-flamegraph

Easy CPU Profiling for Apache Spark applications
Apache License 2.0
45 stars 12 forks source link

Can this be used with gcp ? #8

Open normalscene opened 4 years ago

normalscene commented 4 years ago

Hi, I was wondering if this could be used with google cloud platform?

spektom commented 4 years ago

Hi. There shouldn't be any - the tool is supposed to work with any Spark distribution. Please try and report issues back if any :)

normalscene commented 4 years ago

Hi,

I had to replace the stock 'spark-submit' with your 'spark-submit-flamegraph'. Currently it doesn't work and seem to have couple of issues.

  1. It complained about $HOME not set on line 233. I fixed it by defining HOME inside script. fixed.
  2. After that it is stuck on line 31 (trying to find a free port) and just keeps on printing the below shown output.

Do you have any suggestions?

gaurav_arya_figmd_com@deltest-m:~$ ls -lrth /usr/bin/spark-submit
lrwxrwxrwx 1 root root 51 Feb  7 11:43 /usr/bin/spark-submit -> /home/gaurav_arya_figmd_com/spark-submit-flamegraph
gaurav_arya_figmd_com@deltest-m:~$ 

gaurav_arya_figmd_com@deltest-m:~$ gcloud dataproc jobs submit spark --project bda-sandbox --cluster deltest --region us-central1  --properties spark.submit.deployMode=cluster,spark.dynamicAllocation.enabled=false,spark.yarn.maxAppAttempts=1,spark.driver.memory=4G,spark.driver.memoryOverhead=1024m,spark.executor.instances=3,spark.executor.memoryOverhead=1024m,spark.executor.memory=4G,spark.executor.cores=2,spark.driver.cores=1,spark.driver.maxResultSize=2g,spark.extraListeners=com.qubole.sparklens.QuboleJobListener --class com.figmd.janus.deletion.dataCleanerMain --jars=gs://cdrmigration/jars/newDataCleaner.jar,gs://spark-lib/bigquery/spark-bigquery-latest.jar,gs://cdrmigration/jars/jdbc-postgresql.jar,gs://cdrmigration/jars/postgresql-42.2.5.jar,gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar  -- cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001
Job [b28c81b219b54ebbafaf2d15ff7e8549] submitted.
Waiting for job output...
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
/usr/bin/spark-submit: line 31: echo: write error: Broken pipe
.
.
.
spektom commented 4 years ago

Looks like a bug. I don't have a way to re-create this now. If you want to help debug this issue, please:

Thanks!

normalscene commented 4 years ago

Alright - so I figured out the issue finally - I had to install a couple of things like telnet & pip, and I was not aware that the system didn't have it. I got a warning for pip but not for telnet. Maybe you would add a check for required binaries so that if they are not found, the user get's proper indication. Just a suggestion.

So after fixing all the minor issues it errors out on "Couldn't start InfluxDB!".

Question: Is there any additional logging apart from ~/.spark-flamegraph to tackle the below issue?

[2020-02-07T12:18:20.1581077900] Installing dependencies
[2020-02-07T12:18:22.1581077902] Starting InfluxDB
[2020-02-07T12:18:22.1581077902] InfluxDB starting at :48137
ERROR: Couldn't start InfluxDB!
[2020-02-07T12:18:32.1581077912] Spark has exited with bad exit code (1)
[2020-02-07T12:18:32.1581077912] Collecting profiling metrics
[2020-02-07T12:18:32.1581077912] No profiling metrics were recorded!
[2020-02-07T12:18:32.1581077912] Spark has exited with bad exit code (1)
spektom commented 4 years ago

There's log file called influxdb.log, can you look there please?

spektom commented 4 years ago

Also, if you've replaced original spark-submit command with this script, make sure to set SPARK_CMD to the original version, because it's still needed:

mv /usr/bin/spark-submit /usr/bin/spark-submit-orig
cp spark-submit-flamegraph /usr/bin/spark-submit
SPARK_CMD=spark-submit-orig spark-submit ...
normalscene commented 4 years ago

influxdb.log

Unfortunately, there are no logs inside the said directory. I have checked thoroughly. :(

gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ pwd
/home/gaurav_arya_figmd_com/.spark-flamegraph/influxdb
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ find -name "influxdb.log"
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ 
normalscene commented 4 years ago

Also, if you've replaced original spark-submit command with this script, make sure to set SPARK_CMD to the original version, because it's still needed:

mv /usr/bin/spark-submit /usr/bin/spark-submit-orig
cp spark-submit-flamegraph /usr/bin/spark-submit
SPARK_CMD=spark-submit-orig spark-submit ...

Let me try this. Could you please confirm the third step i.e. SPARK_CMD one ? It is not that clear to me. I will give it a try right now. Do I need to make the change inside the spark-submit-flamegraph script ?

spektom commented 4 years ago

influxdb.log is created in current directory, sorry for misleading you. SPARK_CMD is a variable that points to the original spark-submit script. By default it's set to spark-submit, but it could be spark-shell or spark-submit-orig if you moved it away.

normalscene commented 4 years ago

Alright I have gone ahead with making a change inside your script, as shown below:

SPARK_CMD=${SPARK_CMD:-spark-submit-orig}

But the job has failed. Here are some logs.

Hadoop logs

Log Type: prelaunch.err

Log Upload Time: Fri Feb 07 12:38:54 +0000 2020

Log Length: 0

Log Type: prelaunch.out

Log Upload Time: Fri Feb 07 12:38:54 +0000 2020

Log Length: 70

Setting up env variables
Setting up job resources
Launching container

Log Type: stderr

Log Upload Time: Fri Feb 07 12:38:54 +0000 2020

Log Length: 119

Error opening zip file or JAR manifest missing : /home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar

Log Type: stdout

Log Upload Time: Fri Feb 07 12:38:54 +0000 2020

Log Length: 84

Error occurred during initialization of VM
agent library failed to init: instrument

Command line logs

gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ time { gcloud dataproc jobs submit spark --project bda-sandbox --cluster deltest --region us-central1  --properties spark.submit.deployMode=cluster,spark.dynamicAllocation.enabled=false,spark.yarn.maxAppAttempts=1,spark.driver.memory=4G,spark.driver.memoryOverhead=1024m,spark.executor.instances=3,spark.executor.memoryOverhead=1024m,spark.executor.memory=4G,spark.executor.cores=2,spark.driver.cores=1,spark.driver.maxResultSize=2g,spark.extraListeners=com.qubole.sparklens.QuboleJobListener --class com.figmd.janus.deletion.dataCleanerMain --jars=gs://cdrmigration/jars/newDataCleaner.jar,gs://spark-lib/bigquery/spark-bigquery-latest.jar,gs://cdrmigration/jars/jdbc-postgresql.jar,gs://cdrmigration/jars/postgresql-42.2.5.jar,gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar  -- cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001 2>&1 | tee log ; }
tee: log: Permission denied
Job [47a6046ef73940ee9560d2b56b0a404c] submitted.
Waiting for job output...
[2020-02-07T12:38:42.1581079122] Installing dependencies
[2020-02-07T12:38:44.1581079124] Starting InfluxDB
[2020-02-07T12:38:44.1581079124] InfluxDB starting at :48081
[2020-02-07T12:38:46.1581079126] Executing: spark-submit-orig --jars /home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/newDataCleaner.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/spark-bigquery-latest.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/jdbc-postgresql.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/postgresql-42.2.5.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/sparklens_2.11-0.3.1.jar --driver-java-options -javaagent:/home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.driver.cores=1 --conf spark.driver.maxResultSize=2g --conf spark.driver.memory=4G --conf spark.driver.memoryOverhead=1024m --conf spark.dynamicAllocation.enabled=false --conf spark.executor.cores=2 --conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.executor.memoryOverhead=1024m --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener --conf spark.submit.deployMode=cluster --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.tags=dataproc_hash_55904610-b3ad-3c58-9ab3-638a84e7c4db,dataproc_job_47a6046ef73940ee9560d2b56b0a404c,dataproc_master_index_0,dataproc_uuid_bb5702d6-bbab-36d1-8fc4-c4aa06211b89 --class com.figmd.janus.deletion.dataCleanerMain /tmp/47a6046ef73940ee9560d2b56b0a404c/dataproc-empty-jar-1581079121265.jar cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001
20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at deltest-m/10.128.0.31:8032
20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at deltest-m/10.128.0.31:10200
20/02/07 12:38:52 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1581075454418_0006
Exception in thread "main" org.apache.spark.SparkException: Application application_1581075454418_0006 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1521)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[2020-02-07T12:38:54.1581079134] Spark has exited with bad exit code (1)
[2020-02-07T12:38:54.1581079134] Collecting profiling metrics
[2020-02-07T12:38:54.1581079134] No profiling metrics were recorded!
ERROR: (gcloud.dataproc.jobs.submit.spark) Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://some-gs-bucket-location-for-logs?project=some-project&region=some-region' and in 'gs://some-gs-bucket-location'.

real    0m17.140s
user    0m0.535s
sys 0m0.071s
gaurav_arya_figmd_com@deltest-m:~/.spark-flamegraph/influxdb$ 

gcloud logs

gaurav_arya_figmd_com@deltest-m:~$ cat ./.config/gcloud/logs/2020.02.07/12.38.39.592453.log
2020-02-07 12:38:39,593 DEBUG    root            Loaded Command Group: [u'gcloud', u'dataproc']
2020-02-07 12:38:39,594 DEBUG    root            Loaded Command Group: [u'gcloud', u'dataproc', u'jobs']
2020-02-07 12:38:39,657 DEBUG    root            Loaded Command Group: [u'gcloud', u'dataproc', u'jobs', u'submit']
2020-02-07 12:38:39,660 DEBUG    root            Loaded Command Group: [u'gcloud', u'dataproc', u'jobs', u'submit', u'spark']
2020-02-07 12:38:39,663 DEBUG    root            Running [gcloud.dataproc.jobs.submit.spark] with arguments: [--class: "com.figmd.janus.deletion.dataCleanerMain", --cluster: "deltest", --jars: "[u'gs://cdrmigration/jars/newDataCleaner.jar', u'gs://spark-lib/bigquery/spark-bigquery-latest.jar', u'gs://cdrmigration/jars/jdbc-postgresql.jar', u'gs://cdrmigration/jars/postgresql-42.2.5.jar', u'gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar']", --project: "bda-sandbox", --properties: "OrderedDict([(u'spark.submit.deployMode', u'cluster'), (u'spark.dynamicAllocation.enabled', u'false'), (u'spark.yarn.maxAppAttempts', u'1'), (u'spark.driver.memory', u'4G'), (u'spark.driver.memoryOverhead', u'1024m'), (u'spark.executor.instances', u'3'), (u'spark.executor.memoryOverhead', u'1024m'), (u'spark.executor.memory', u'4G'), (u'spark.executor.cores', u'2'), (u'spark.driver.cores', u'1'), (u'spark.driver.maxResultSize', u'2g'), (u'spark.extraListeners', u'com.qubole.sparklens.QuboleJobListener')])", --region: "us-central1"]
2020-02-07 12:38:39,929 INFO     ___FILE_ONLY___ Job [47a6046ef73940ee9560d2b56b0a404c] submitted.

2020-02-07 12:38:39,929 INFO     ___FILE_ONLY___ Waiting for job output...

2020-02-07 12:38:44,317 INFO     ___FILE_ONLY___ [2020-02-07T12:38:42.1581079122] Installing dependencies

2020-02-07 12:38:45,501 INFO     ___FILE_ONLY___ [2020-02-07T12:38:44.1581079124] Starting InfluxDB
[2020-02-07T12:38:44.1581079124] InfluxDB starting at :48081

2020-02-07 12:38:46,618 INFO     ___FILE_ONLY___ [2020-02-07T12:38:46.1581079126] Executing: spark-submit-orig --jars /home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/newDataCleaner.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/spark-bigquery-latest.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/jdbc-postgresql.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/postgresql-42.2.5.jar,/tmp/47a6046ef73940ee9560d2b56b0a404c/sparklens_2.11-0.3.1.jar --driver-java-options -javaagent:/home/gaurav_arya_figmd_com/.spark-flamegraph/statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler.jar=server=10.128.0.31,port=48081,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=sparkapp,tagMapping=spark --conf spark.driver.cores=1 --conf spark.driver.maxResultSize=2g --conf spark.driver.memory=4G --conf spark.driver.memoryOverhead=1024m --conf spark.dynamicAllocation.enabled=false --conf spark.executor.cores=2 --conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.executor.memoryOverhead=1024m --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener --conf spark.submit.deployMode=cluster --conf spark.yarn.maxAppAttempts=1 --conf spark.yarn.tags=dataproc_hash_55904610-b3ad-3c58-9ab3-638a84e7c4db,dataproc_job_47a6046ef73940ee9560d2b56b0a404c,dataproc_master_index_0,dataproc_uuid_bb5702d6-bbab-36d1-8fc4-c4aa06211b89 --class com.figmd.janus.deletion.dataCleanerMain /tmp/47a6046ef73940ee9560d2b56b0a404c/dataproc-empty-jar-1581079121265.jar cdr 289 PatientEthnicity,PatientRace bda-sandbox CDRDELTEST 20200121 0001

2020-02-07 12:38:49,876 INFO     ___FILE_ONLY___ 20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at deltest-m/10.128.0.31:8032

2020-02-07 12:38:50,982 INFO     ___FILE_ONLY___ 20/02/07 12:38:49 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at deltest-m/10.128.0.31:10200

2020-02-07 12:38:54,249 INFO     ___FILE_ONLY___ 20/02/07 12:38:52 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1581075454418_0006

2020-02-07 12:38:55,360 INFO     ___FILE_ONLY___ Exception in thread "main" org.apache.spark.SparkException: Application application_1581075454418_0006 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1166)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1521)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:890)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:192)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:217)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[2020-02-07T12:38:54.1581079134] Spark has exited with bad exit code (1)
[2020-02-07T12:38:54.1581079134] Collecting profiling metrics
[2020-02-07T12:38:54.1581079134] No profiling metrics were recorded!

2020-02-07 12:38:56,441 DEBUG    root            (gcloud.dataproc.jobs.submit.spark) Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/47a6046ef73940ee9560d2b56b0a404c?project=bda-sandbox&region=us-central1' and in 'gs://dataproc-ded4155e-8ecc-4627-aab5-15befb5c5e37-us-central1/google-cloud-dataproc-metainfo/dec63309-39e1-4c03-84a4-ccecd8b6a54b/jobs/47a6046ef73940ee9560d2b56b0a404c/driveroutput'.
Traceback (most recent call last):
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 981, in Execute
    resources = calliope_command.Run(cli=self, args=args)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 807, in Run
    resources = command_instance.Run(args)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/dataproc/jobs/submitter.py", line 102, in Run
    stream_driver_log=True)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib/dataproc/util.py", line 441, in WaitForJobTermination
    job_ref.jobId, job.status.details))
JobError: Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/47a6046ef73940ee9560d2b56b0a404c?project=bda-sandbox&region=us-central1' and in 'gs://dataproc-ded4155e-8ecc-4627-aab5-15befb5c5e37-us-central1/google-cloud-dataproc-metainfo/dec63309-39e1-4c03-84a4-ccecd8b6a54b/jobs/47a6046ef73940ee9560d2b56b0a404c/driveroutput'.
2020-02-07 12:38:56,442 ERROR    root            (gcloud.dataproc.jobs.submit.spark) Job [47a6046ef73940ee9560d2b56b0a404c] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at 'https://console.cloud.google.com/dataproc/jobs/47a6046ef73940ee9560d2b56b0a404c?project=bda-sandbox&region=us-central1' and in 'gs://dataproc-ded4155e-8ecc-4627-aab5-15befb5c5e37-us-central1/google-cloud-dataproc-metainfo/dec63309-39e1-4c03-84a4-ccecd8b6a54b/jobs/47a6046ef73940ee9560d2b56b0a404c/driveroutput'.
gaurav_arya_figmd_com@deltest-m:~$  
normalscene commented 4 years ago

influxdb.log

That is alright Michael. No issues. :).

Unfortunately there is no log with that name. I have pasted additional logs (whatever I could find and have access to at the moment). If something comes up - please let me know. If something is missing - please also do let me know and I will try to get them as soon as possible.

I am willing to assist/help to debug this issue as I really want to have that flamegraph.

normalscene commented 4 years ago

@spektom

Hello Michael. I am just following up with you on this. Do you have any suggestions to troubleshoot this any further? Thank you in advance.

Cheers, Gaurav

spektom commented 4 years ago

I haven't had an idea yet.. I'll try to create an evaluation account in GCP and debug it there.

On Mon, Feb 10, 2020, at 16:51, NormalScene wrote:

@spektom https://github.com/spektom

Hello Michael. I am just following up with you on this. Do you have any suggestions to troubleshoot this any further? Thank you in advance.

Cheers, Gaurav

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spektom/spark-flamegraph/issues/8?email_source=notifications&email_token=AABGSO3DLDUPEXTWH4KXPUDRCFSYNA5CNFSM4KRJTQV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELIZDTI#issuecomment-584159693, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSOZ6N2IKA6MOYULYKQLRCFSYNANCNFSM4KRJTQVQ.