Open kannan-zeotap opened 6 years ago
Update:
I updated the url of oozie host in app-conf/SchedulerConf.xml replacing localhost with master node ip.
Now the Oozie workflows are getting listed in dr.elephant dashboard, but the spark jobs are not captured.
I don't know if you use the spark version?
Hello,
I'm trying to install/configure Dr.Elephant in one of the EMR core nodes. The core node doesn't have any spark, oozie apps installed as they're installed in Master nodes. In our platform, we're running Spark jobs scheduled via Oozie co-ordinators everyday. Initially we configured Dr.Elephant in master node, it worked fine and everyday jobs are captured/analysed in Dr.Elephant perfectly.
But in configuring in core node, the drelephant service is running but it's not analysing any jobs.
I copied all the confs, jars from the master to the core nodes and set hadoop_home, spark_home accordingly.
Below is the application log.
09-04-2018 08:31:59 WARN [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Add analytic job id [application_1535464521333_3324] into the retry list. 09-04-2018 08:31:59 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Retry queue size is 7 09-04-2018 08:31:59 INFO [dr-el-executor-thread-2] com.linkedin.drelephant.ElephantRunner : Analyzing MAPREDUCE application_1535464521333_3313 09-04-2018 08:31:59 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : Could not invoke class com.linkedin.drelephant.schedulers.OozieScheduler 09-04-2018 08:31:59 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : java.lang.RuntimeException: Could not invoke class com.linkedin.drelephant.schedulers.OozieScheduler at com.linkedin.drelephant.util.InfoExtractor.getSchedulerInstance(InfoExtractor.java:101) at com.linkedin.drelephant.util.InfoExtractor.loadInfo(InfoExtractor.java:126) at com.linkedin.drelephant.analysis.AnalyticJob.getAnalysis(AnalyticJob.java:327) at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.linkedin.drelephant.util.InfoExtractor.getSchedulerInstance(InfoExtractor.java:84) ... 8 more Caused by: java.lang.RuntimeException: Failed fetching Oozie workflow 0000965-180827104704635-oozie-oozi-W info at com.linkedin.drelephant.schedulers.OozieScheduler.loadInfo(OozieScheduler.java:113) at com.linkedin.drelephant.schedulers.OozieScheduler.<init>(OozieScheduler.java:79) at com.linkedin.drelephant.schedulers.OozieScheduler.<init>(OozieScheduler.java:64) ... 13 more Caused by: IO_ERROR : java.io.IOException: Error while connecting Oozie server. No of retries = 4. Exception = Connection refused (Connection refused)
And the dr.log,
[hadoop@ip-10-40-12-181 dr-elephant-2.1.7]$ tailf dr.log SLF4J: Found binding in [jar:file:/opt/dr-elephant-master/dist/dr-elephant-2.1.7/lib/ch.qos.logback.logback-classic-1.0.13.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/dr-elephant-master/dist/dr-elephant-2.1.7/lib/org.slf4j.slf4j-simple-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/dr-elephant-master/dist/dr-elephant-2.1.7/lib/org.slf4j.slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder] [info] play - database [default] connected at jdbc:mysql://ireland-all-eu-daap-drelephant-zt.ctw8vnsdkjzr.eu-west-1.rds.amazonaws.com/drelephant?characterEncoding=UTF-8 [info] application - Starting Application... [info] play - Application started (Prod) [info] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9000 Connection exception has occurred [ java.net.ConnectException Connection refused (Connection refused) ]. Trying after 1 sec. Retry count = 1 Connection exception has occurred [ java.net.ConnectException Connection refused (Connection refused) ]. Trying after 2 sec. Retry count = 2 Connection exception has occurred [ java.net.ConnectException Connection refused (Connection refused) ]. Trying after 4 sec. Retry count = 3