linkedin / dr-elephant

Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark
Apache License 2.0
1.35k stars 859 forks source link

Dr.Elephant not analysing jobs in EMR core node #431

Open kannan-zeotap opened 6 years ago

kannan-zeotap commented 6 years ago

Hello,

I'm trying to install/configure Dr.Elephant in one of the EMR core nodes. The core node doesn't have any spark, oozie apps installed as they're installed in Master nodes. In our platform, we're running Spark jobs scheduled via Oozie co-ordinators everyday. Initially we configured Dr.Elephant in master node, it worked fine and everyday jobs are captured/analysed in Dr.Elephant perfectly.

But in configuring in core node, the drelephant service is running but it's not analysing any jobs.

I copied all the confs, jars from the master to the core nodes and set hadoop_home, spark_home accordingly.

Below is the application log.

09-04-2018 08:31:59 WARN [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1535464521333_3324] into the retry list. 09-04-2018 08:31:59 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 7 09-04-2018 08:31:59 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing MAPREDUCE application_1535464521333_3313 09-04-2018 08:31:59 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : Could not invoke class com.linkedin.drelephant.schedulers.OozieScheduler 09-04-2018 08:31:59 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : java.lang.RuntimeException: Could not invoke class com.linkedin.drelephant.schedulers.OozieScheduler at com.linkedin.drelephant.util.InfoExtractor.getSchedulerInstance(InfoExtractor.java:101) at com.linkedin.drelephant.util.InfoExtractor.loadInfo(InfoExtractor.java:126) at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:327) at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.linkedin.drelephant.util.InfoExtractor.getSchedulerInstance(InfoExtractor.java:84) ... 8 more Caused by: java.lang.RuntimeException: Failed fetching Oozie workflow 0000965-180827104704635-oozie-oozi-W info at com.linkedin.drelephant.schedulers.OozieScheduler.loadInfo(OozieScheduler.java:113) at com.linkedin.drelephant.schedulers.OozieScheduler.<init>(OozieScheduler.java:79) at com.linkedin.drelephant.schedulers.OozieScheduler.<init>(OozieScheduler.java:64) ... 13 more Caused by: IO_ERROR : java.io.IOException: Error while connecting Oozie server. No of retries = 4. Exception = Connection refused (Connection refused)

And the dr.log,

[hadoop@ip-10-40-12-181 dr-elephant-2.1.7]$ tailf dr.log SLF4J: Found binding in [jar:file:/opt/dr-elephant-master/dist/dr-elephant-2.1.7/lib/ch.qos.logback.logback-classic-1.0.13.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/dr-elephant-master/dist/dr-elephant-2.1.7/lib/org.slf4j.slf4j-simple-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/dr-elephant-master/dist/dr-elephant-2.1.7/lib/org.slf4j.slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder] [info] play - database [default] connected at jdbc:mysql://ireland-all-eu-daap-drelephant-zt.ctw8vnsdkjzr.eu-west-1.rds.amazonaws.com/drelephant?characterEncoding=UTF-8 [info] application - Starting Application... [info] play - Application started (Prod) [info] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9000 Connection exception has occurred [ java.net.ConnectException Connection refused (Connection refused) ]. Trying after 1 sec. Retry count = 1 Connection exception has occurred [ java.net.ConnectException Connection refused (Connection refused) ]. Trying after 2 sec. Retry count = 2 Connection exception has occurred [ java.net.ConnectException Connection refused (Connection refused) ]. Trying after 4 sec. Retry count = 3

kannan-zeotap commented 6 years ago

Update:

I updated the url of oozie host in app-conf/SchedulerConf.xml replacing localhost with master node ip.

oozie com.linkedin.drelephant.schedulers.OozieScheduler http://localhost:11000/oozie

Now the Oozie workflows are getting listed in dr.elephant dashboard, but the spark jobs are not captured.

20100507 commented 6 years ago

I don't know if you use the spark version?