linkedin / dr-elephant

Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark
Apache License 2.0
1.35k stars 859 forks source link

Dr-Elephant not fetching RUNNING spark application (only succeeded and failed applications are fetched) #696

Open nelhaj opened 4 years ago

nelhaj commented 4 years ago

Hi,

Dr-Elephant is only fetching completed applications (filtered by SUCCEEDED or FAILED status). Our spark streaming applications are always RUNNING non-stop (except for weekly restarts). We want to be able to analyze them and generate real time heuristics.

Why does dr-elephant exclude running application ? Is there a way to include them when fetching jobs list?

More details:

Running spark application are available in both YARN HS and Spark HS. I can retrieve log events by accessing http://{SHS_HOST}/api/v1/applications/application_xxxxxxxxxx_xxxxxx/1/logs

Thank you

nelhaj commented 4 years ago

Hi @ShubhamGupta29 : Could you help us on this subject please. PS: We have made good progress in implementing this feature. it seems to work fine. We can see Spark Streaming Heuristics. We are using the Spark FsFetcher. We will keep you posted on our progress.

I would like to know why the dr-elephant does not support fetching RUNNING applications natively. Is there a reason for this choice (performance, technical constraints, ...).

Thx

ShubhamGupta29 commented 4 years ago

Initially, Dr.Elephant was designed to profile a Hadoop job after finishes. This idea stayed with the Spark Heuristics too. But with the increased demand Spark streaming we do know the importance of a tool to track your jobs' performance.

The reason for not supporting the Spark Streaming applications is the large logs. Currently, SHS doesn't provide any incremental parsing of logs, so if Dr.Elephant analyzes a RUNNING application at some short interval then it has to parse the whole logs every time and with Streaming jobs, this issue becomes critical as their log size keeps on increasing. This will hog the Dr.Elephant's resources and lead to delays in report generation etc. With the batch jobs, the need for real-time profiling is not that missed, so there are challenges to support RUNNING apps in Dr.Elephant.

I would be glad to know how you are approaching these challenges and would try to provide any needed assistance from my end.

nelhaj commented 3 years ago

Hi, @ShubhamGupta29 Thank you for your clear clarification and sorry for the late reply In fact, we are also facing the same performance issues for spark streaming apps analysis.

We try to deal with these problems in the following way :

Javid-Shaik commented 3 months ago

Hi @nelhaj We're also want to implement the spark streaming jobs analysis so can you please share how you achieved this.

Could you share how you modified Dr. Elephant to fetch and analyze running applications?

Any additional tips or considerations for implementing this feature.

Your insights would be greatly appreciated.

Thank you