linkedin / dr-elephant

Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark
Apache License 2.0
1.35k stars 855 forks source link

Spark jobs not showing up on Dr Elephant UI #456

Open kartiknooli opened 5 years ago

kartiknooli commented 5 years ago

hello, I am having a similar issue like a lot of others mentioned but none of those tickets helped me resolve my issue. My spark jobs won't show up on Dr. Elephant UI. I can only see MapReduce jobs. I went through this thread but could not figure out where to find dr elephant logs for the spark jobs? I am on EMR with Hadoop v 2.7.3, Spark 2.1.1. All the configs you mentioned above exist in my cluster. I can see the running spark job on the Resource Manager UI as well as spark history server once it's completed.

spark.yarn.historyServer.address ip-10-XX-XX-X.ec2.internal:18080 spark.eventLog.dir hdfs:///var/log/spark/apps Here is how my dr elephant folder looks like: drwxr-xr-x 2 ec2-user ec2-user 4096 Oct 24 16:29 app-conf drwxr-xr-x 2 ec2-user ec2-user 4096 Oct 17 22:29 bin drwxr-xr-x 3 ec2-user ec2-user 4096 Oct 17 22:29 conf -rwxr-xr-x 1 ec2-user ec2-user 1199 Oct 24 16:30 dr.log drwxr-xr-x 2 ec2-user ec2-user 16384 Oct 17 22:29 lib drwxr-xr-x 2 ec2-user ec2-user 4096 Oct 24 16:31 logs -rwxr-xr-x 1 ec2-user ec2-user 2925 Oct 17 22:26 README.md -rw-r--r-- 1 root root 5 Oct 24 16:30 RUNNING_PID drwxr-xr-x 3 ec2-user ec2-user 4096 Oct 17 22:29 scripts drwxr-xr-x 3 ec2-user ec2-user 4096 Oct 17 22:29 share echo $SPARK_HOME /usr/lib/spark

echo $SPARK_CONF_DIR /usr/lib/spark/conf Am I missing something here? Please help.

thanks, Kartik.

ColinArmstrong commented 5 years ago

There is a logs directory before the your dr.elephant folder that I didn't see you list.

$DR_ELEPHANT_DIR/../logs/elephant/dr_elephant.log

kartiknooli commented 5 years ago

Thanks @ColinArmstrong for the response. I did check and here is the log and this time reran another spark job on the cluster and noticed that the elephant UI says it is a Hadoop job and doesn't identify it as a spark job. The dr-elephant.log file does not give me any error messages. Is my understanding not right about how Dr Elephant displays spark jobs on the UI?

When i filter out the jobs on the UI by Job Type Spark, it returns no results.

thanks, Kartik.

shahrukhkhan489 commented 5 years ago

Is HTTPS enabled on YARN? If HTTPS is not enabled then use the below steps to get it working

  1. Inject exports of SPARK_HOME and SPARK_CONF_DIR in ./bin/start.sh file.

  2. Make sure you have Spark Client Installed as a Component is you are using Vendor Specific Distribution.

  3. Update the Spark fetcher configuration to com.linkedin.drelephant.spark.fetchers.SparkFetcher in the conf file app-conf/FetcherConf.xml. By default it is commented

This should get Dr. Elephant working against Spark Jobs.

lubomir-angelov commented 5 years ago

@kartiknooli

To find the dr_elephant.log use $locate dr_elephant.log.

In my case to start getting Spark jobs I had to add the following in app-conf/FetcherConf.xml

`

spark com.linkedin.drelephant.spark.fetchers.SparkFetcher true true webhdfs:///spark-history

`

Our Spark event log dir is configured as hdfs:///spark-history -> we added <event_log_dir>webhdfs:///spark-history</event_log_dir>

And comment out these lines:

`spark

com.linkedin.drelephant.spark.fetchers.FSFetcher` More info at #206
kartiknooli commented 5 years ago

@shahrukhkhan489 and @lubomir-angelov thanks for the response.

I tried making the suggested changes.

  1. Inject exports of SPARK_HOME and SPARK_CONF_DIR in ./bin/start.sh file. I hope you meant the following:

    export SPARK_HOME=/usr/lib/spark
    export SPARK_CONF_DIR=/etc/spark/conf

    Please correct me if I am wrong.

  2. Make sure you have Spark Client Installed as a Component is you are using Vendor Specific Distribution. We have spark client bootstrapped with EMR

    
    Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
    /__ / .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Python version 2.7.12 (default, Sep 1 2016 22:14:00) SparkSession available as 'spark'.

  1. Updated the Spark fetcher configuration to the following:
<fetcher>
    <applicationtype>spark</applicationtype>
    <classname>com.linkedin.drelephant.spark.fetchers.SparkFetcher</classname>
    <params>
      <use_rest_for_eventlogs>true</use_rest_for_eventlogs>
      <should_process_logs_locally>true</should_process_logs_locally>
    </params>
  </fetcher>

I tried with and without adding the hdfs path for the eventlogs. both of them did not work.

Here is the error message i got from the logs:

11-26-2018 19:24:35 INFO  [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1520505558307_35023
11-26-2018 19:24:35 INFO  [ForkJoinPool-1-worker-9] com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://hostname:18080/api/v1/applications/application_1520505558307_35027
11-26-2018 19:24:35 INFO  [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Fetching data for application_1520505558307_35023
11-26-2018 19:24:35 INFO  [ForkJoinPool-1-worker-5] com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://hostname:18080/api/v1/applications/application_1520505558307_35023
11-26-2018 19:24:35 ERROR [ForkJoinPool-1-worker-9] com.linkedin.drelephant.spark.fetchers.SparkRestClient : error reading applicationInfo http:hostname:18080/api/v1/applications/application_1520505558307_35027. Exception Message = HTTP 404 Not Found
11-26-2018 19:24:35 WARN  [dr-el-executor-thread-1] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1520505558307_35027. I will retry after some time! Exception Message is: HTTP 404 Not Found

Appreciate your help with this.

lubomir-angelov commented 5 years ago

It looks like your spark history server is not responding.

I think you need a patched version of SHS to get Spark2 jobs registered. https://github.com/linkedin/dr-elephant/issues/327

On Mon, Nov 26, 2018, 21:41 Kartik notifications@github.com wrote:

@shahrukhkhan489 https://github.com/shahrukhkhan489 and @lubomir-angelov https://github.com/lubomir-angelov thanks for the response.

I tried making the suggested changes.

  1. Inject exports of SPARK_HOME and SPARK_CONF_DIR in ./bin/start.sh file. I hope you meant the following:

export SPARK_HOME=/usr/lib/spark export SPARK_CONF_DIR=/etc/spark/conf

Please correct me if I am wrong.

  1. Make sure you have Spark Client Installed as a Component is you are using Vendor Specific Distribution. We have spark client bootstrapped with EMR

Welcome to


 / __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/

/ / ./_,// //_\ version 2.1.1 //

Using Python version 2.7.12 (default, Sep 1 2016 22:14:00) SparkSession available as 'spark'.

  1. Updated the Spark fetcher configuration to the following:
spark com.linkedin.drelephant.spark.fetchers.SparkFetcher true true

I tried with and without adding the hdfs path for the eventlogs. both of them did not work.

Here is the error message i got from the logs:

11-26-2018 19:24:35 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing SPARK application_1520505558307_35023 11-26-2018 19:24:35 INFO [ForkJoinPool-1-worker-9] com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://hostname:18080/api/v1/applications/application_1520505558307_35027 11-26-2018 http://hostname:18080/api/v1/applications/application_1520505558307_3502711-26-2018 19:24:35 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Fetching data for application_1520505558307_35023 11-26-2018 19:24:35 INFO [ForkJoinPool-1-worker-5] com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://hostname:18080/api/v1/applications/application_1520505558307_35023 11-26-2018 http://hostname:18080/api/v1/applications/application_1520505558307_3502311-26-2018 19:24:35 ERROR [ForkJoinPool-1-worker-9] com.linkedin.drelephant.spark.fetchers.SparkRestClient : error reading applicationInfo http:hostname:18080/api/v1/applications/application_1520505558307_35027. Exception Message = HTTP 404 Not Found 11-26-2018 19:24:35 WARN [dr-el-executor-thread-1] com.linkedin.drelephant.spark.fetchers.SparkFetcher : Failed fetching data for application_1520505558307_35027. I will retry after some time! Exception Message is: HTTP 404 Not Found

Appreciate your help with this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/linkedin/dr-elephant/issues/456#issuecomment-441771127, or mute the thread https://github.com/notifications/unsubscribe-auth/AGaxRL8q8uD9_vKM8v31MusR_IqrjNaZks5uzEPVgaJpZM4X4XQ2 .

shahrukhkhan489 commented 5 years ago

@kartiknooli The error 404 indicates that your logs have been rolled out. This might not be the same case with all spark applications

error reading applicationInfo http:hostname:18080/api/v1/applications/application_1520505558307_35027. Exception Message = HTTP 404 Not Found

Try opening the same link using browser. You will see the same log - http:hostname:18080/api/v1/applications/application_1520505558307_35027

fusonghe commented 5 years ago

doesn't exist dr-elephant webUI sparkjobs I am at dr-elephant version 2.1.7 hadoop3.0.0 spark1.6 at app-conf/FetcherConf.xml

spark org.apache.spark.deploy.history.SparkFSFetcher

100 /spark2-history .snappy @shahrukhkhan489