linkedin / dr-elephant

Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark
Apache License 2.0
1.35k stars 858 forks source link

Unable to get correct metrics for spark #389

Open Parth59 opened 6 years ago

Parth59 commented 6 years ago

Hi, I know that enabling spark to work correctly with dr-elephant is already mentioned in #327 . Following the details I tried using spark-rest client through which I am able to extract spark jobs data. But in the UI all spark metrics are displayed as 0 as attached in the screenshot. Can anyone please post the steps for configuring spark 2.x to work correctly with Dr-Elephant.

screen shot 2018-05-24 at 5 29 24 am
simul-tion commented 6 years ago

Hi, I have the same question, only spark configuration metric being flagged(amber asking for further tuning for some of the jobs ), stats for rest of all is mostly green with null/0 values in metric stats. Is it a config issue or the spark2.x - elephant compatibility issue?

Regards

Seandity commented 6 years ago

U should modify config set spark version into 2.x , then , update U spark cord API for 2.x ~ now there has no spark 2.x API interface,U should modify it by u self , hope help to U

simul-tion commented 6 years ago

@Seandity Thanks for your response, can I request you to be more specific. I'm not sure I understand what is the proposed approach. Can you share an example probably of what you are directing to?

Regards

simul-tion commented 6 years ago

Hi,

@Parth59 Did you find a way through this?

@Seandity @shkhrgpt @akshayrai Taking a reference from #327 Can someone confirm that these metrics are dependent on the open item "SPARK-23206"? If not really appreciate if you can share how to sort this out.

Regards

simul-tion commented 6 years ago

Hi,

Appreciate if someone can suggest on this .

Regards

shkhrgpt commented 6 years ago

Sorry for the late response. This issue is happening due to the fact that Dr elephant does not support Spark 2.X apps. What's making it confusing is that you are seeing Spark 2.X apps on Dr Elephant UI but data is not complete. You are seeing the incomplete data because the fetcher is partially processing the data, and instead of failing it's using the partial data to analyze the result. If you inspect dr elephant logs then you will see the parsing exceptions. I hope this helps.

prachi2396 commented 6 years ago

Hi,

@shkhrgpt do we have a work around for this issue? How do we enable parsing of spark 2.x logs correctly?

simul-tion commented 6 years ago

@shkhrgpt thanks for the update.

Observation and questions

  1. I just verified spawning a new instance of dr elephant against spark1.6 and one of the spark job showed to have suggestions from elephant about other metrics as well.

  2. If dr elephant doesn't support 2.x, then what does #327 point to , can you clarify?

  3. Do you have a visibility of what effort is needed to have elephant support 2.x or is there something already in pipe , assuming it doesn't support 2.x as of now.

thanks in advance.

Regards

simul-tion commented 6 years ago

Hello @akshayrai @shkhrgpt

Appreciate if you could please clarify this.

Regards

shkhrgpt commented 6 years ago

I don't think there is an easy workaround to support Spark 2.x

PR #327 recommends using custom Spark History Server (SHS) which will provide the stable REST APIs to support Spark 2.x in dr elephant. However, as far as I know, all the changes required to this custom SHS are not checked into the open source Spark project. Maybe @akshayrai can provide more detail about this.

I think to support Spark 2.x we need to extend the parser which parses event logs. Most of the parsing logic is implemented in SparkDataCollection class. This class uses various Spark listeners to replay the event logs. The issue is that SparkDataCollection class assumes Spark 1.6 when uses listeners and other related Spark classes. To support Spark 2.x, we can either make SparkDataCollection compatible with Spark 2.x too or make it independent for Spark version.

songgane-zz commented 6 years ago

As mentioned by @shkhrgpt, this is caused by the inability to parse the SHS event log based on spark 2.x. SparkDataCollection class processes SHS event log based on spark 1.4, but spark 2.x will cause error due to newly added SHS event log.

As a temporary measure, I changed the sbt spark version to 2.2.1 or less, and when calling the ReplayListenerBus replay, I made use of the ReplayEventsFilter to support spark 2.x with the exception of the newly added SHS event log in 2.x.

If you do not mind, why not refer to my forked source (https://github.com/songgane/dr-elephant/tree/feature/support_spark_2.x) ?

shkhrgpt commented 6 years ago

Thanks, @songgane for sharing your fix for Spark 2.x support. Would it possible for you to submit a PR for this change?

simul-tion commented 6 years ago

@songgane I tried shared source, but it fails to compile. Is there anything I should be modifying before I compile. (I tried to do so with default configuration - Spark - 1.4 and Hadooop - 2.3)

ritika11 commented 6 years ago

@songgane @shkhrgpt @akshayrai I am running dr elephant on my spark cluster. However i see only below metrics for my jobs. Spark Configuration Spark Executor Metrics Spark Job Metrics Spark Stage Metrics Executor GC

Do we have some additional metrics on cpu/memory utilization here?

ethanhunt07 commented 6 years ago

@ritika11 Are you able to view the metrics with Spark 2+ or spark 1.6? I am still getting metrics with value 0 with spark2 .

ethanhunt07 commented 6 years ago

@songgane I tried your code but it fails to compile ? Do we need to compile it with spark 2+ or 1 ?

ritika11 commented 6 years ago

@ethanhunt07 yes i am able to view the spark heuristics with Spark 2+. However I am looking for options on adding more metrics and heuristics in the code ...

songgane-zz commented 6 years ago

@ethanhunt07 To get spark aggregation metric values, you need to set spark.executor.instances and spark.executor.memory option. Did you set that options? If the spark.executor.instances and spark.executor.memory options are not present, the resulting value is zero. spark default property value is not used.

songgane-zz commented 6 years ago

@Pravdeep Did you compile feature/support_spark_2.x branch? I used spark 2.1.2 and hadoop 2.3.0. Because my code use spark 2.x feature, spark version need to set 2.x+.

simul-tion commented 6 years ago

@songgane Compilation failed with - Spark - 1.4 and Hadooop - 2.3 So I tried with Spark - 2.1.2 and Hadooop - 2.3.0, compilation fails unable to resolve dependencies, is there anything else that needs to be set to be able to compile your source code?(Attaching the logs) Log.txt

songgane-zz commented 6 years ago

@Pravdeep By your log message, it seems that there is a problem with the certificate. You must add a valid certificate . If you google the error message, you will get a lot of information.

[error] Server access Error: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target url=https://repo1.maven.org/maven2/org/apache/geronimo/specs/geronimo-jms_1.1_spec/1.1.1/geronimo-jms_1.1_spec-1.1.1.pom

simul-tion commented 6 years ago

@songgane Yep I noticed, but do you know why your build requires or has this dependency, I didn't have to include any certs when i built generally available dr-elephant branch(which i'm currently running) Do I need to take care of any conf changes while building your source or is it possible for you to share already built version of your branch which supports spark2.x metrics?

songgane-zz commented 6 years ago

@Pravdeep The difference between my branch and dr-elephant master branch is nothing more than a library version. Certificate problems are mainly caused by your build environment problems. Perhaps it is because the libraries are managed privately by using a library management system such as Nexus, or the https service is blocked due to security issues.

Windyhe commented 6 years ago

Almost get the same result as Parth's. I compiled dr. E. with hadoop 2.7.6 and spark 1.6.2 and run on hadoop 2.7.6 and spark 2.3.0. It's OK with hadoopjava jobs but spark jobs. I have checked the dr_elephant.log as follows:

08-10-2018 14:46:41 INFO [dr-el-executor-thread-1] org.apache.spark.deploy.history.SparkFSFetcher$ : Replaying Spark logs for application: application_1533540053870_0023 withlogPath: webhdfs://algo:50070/tmp/spark/events/application_1533540053870_0023.lz4 with codec:Some(org.apache.spark.io.LZ4CompressionCodec@4f5f47dd)

It did not report any error on replaying the spark event log. But the heuristics seem to be all 0's. Here is part of the event log's content. Is there something wrong?

{"Event":"SparkListenerExecutorAdded","Timestamp":1533723506979,"Executor ID":"1","Executor Info":{"Host":"algo","Total Cores":1,"Log Urls":{"stdout":"http://algo:8042/node/containerlogs/container_1533540053870_0015_01_000002/algo/stdout?start=-4096","stderr":"http://algo:8042/node/containerlogs/container_1533540053870_0015_01_000002/algo/stderr?start=-4096"}}} {"Event":"SparkListenerTaskStart","Stage ID":0,"Stage Attempt ID":0,"Task Info":{"Task ID":0,"Index":0,"Attempt":0,"Launch Time":1533723506982,"Executor ID":"1","Host":"algo","Locality":"NODE_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Killed":false,"Accumulables":[]}} {"Event":"SparkListenerBlockManagerAdded","Block Manager ID":{"Executor ID":"1","Host":"algo","Port":34015},"Maximum Memory":4392694579,"Timestamp":1533723507032,"Maximum Onheap Memory":4392694579,"Maximum Offheap Memory":0} {"Event":"SparkListenerExecutorAdded","Timestamp":1533723508088,"Executor ID":"2","Executor Info":{"Host":"algo","Total Cores":1,"Log Urls":{"stdout":"http://algo:8042/node/containerlogs/container_1533540053870_0015_01_000003/algo/stdout?start=-4096","stderr":"http://algo:8042/node/containerlogs/container_1533540053870_0015_01_000003/algo/stderr?start=-4096"}}} {"Event":"SparkListenerTaskStart","Stage ID":0,"Stage Attempt ID":0,"Task Info":{"Task ID":1,"Index":1,"Attempt":0,"Launch Time":1533723508089,"Executor ID":"2","Host":"algo","Locality":"NODE_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Killed":false,"Accumulables":[]}} {"Event":"SparkListenerBlockManagerAdded","Block Manager ID":{"Executor ID":"2","Host":"algo","Port":41386},"Maximum Memory":4392694579,"Timestamp":1533723508146,"Maximum Onheap Memory":4392694579,"Maximum Offheap Memory":0}

songgane-zz commented 6 years ago

@Windyhe if you can't set spark.executor.instances or spark.executor.memory value, aggregation don't work.

YunKillerE commented 5 years ago

@songgane Hi,how to set spark.executor.instances and spark.executor.memory value? Thanks!

I add it to spark-defaults.conf. But in the UI all spark metrics are displayed as 0...

image image

ankurchourasiya commented 5 years ago

@ethanhunt07 yes i am able to view the spark heuristics with Spark 2+. However I am looking for options on adding more metrics and heuristics in the code ...

Can you please share your Fetcher.xml file content. I am also trying to analyze spark 2.3 jobs, though facing issue [error] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: application_1510469066221_0020 org.json4s.package$MappingException: Did not find value which can be converted into boolean. Help is much appreciated.