Open nelhaj opened 4 years ago
Hi @ShubhamGupta29 : Could you help us on this subject please. PS: We have made good progress in implementing this feature. it seems to work fine. We can see Spark Streaming Heuristics. We are using the Spark FsFetcher. We will keep you posted on our progress.
I would like to know why the dr-elephant does not support fetching RUNNING applications natively. Is there a reason for this choice (performance, technical constraints, ...).
Thx
Initially, Dr.Elephant was designed to profile a Hadoop job after finishes. This idea stayed with the Spark Heuristics too. But with the increased demand Spark streaming we do know the importance of a tool to track your jobs' performance.
The reason for not supporting the Spark Streaming applications is the large logs. Currently, SHS doesn't provide any incremental parsing of logs, so if Dr.Elephant analyzes a RUNNING application at some short interval then it has to parse the whole logs every time and with Streaming jobs, this issue becomes critical as their log size keeps on increasing. This will hog the Dr.Elephant's resources and lead to delays in report generation etc. With the batch jobs, the need for real-time profiling is not that missed, so there are challenges to support RUNNING apps in Dr.Elephant.
I would be glad to know how you are approaching these challenges and would try to provide any needed assistance from my end.
Hi, @ShubhamGupta29 Thank you for your clear clarification and sorry for the late reply In fact, we are also facing the same performance issues for spark streaming apps analysis.
We try to deal with these problems in the following way :
Hi @nelhaj
We're also want to implement the spark streaming jobs
analysis so can you please share how you achieved this.
Could you share how you modified Dr. Elephant to fetch and analyze running applications?
Any additional tips or considerations for implementing this feature.
Your insights would be greatly appreciated.
Thank you
Hi,
Dr-Elephant is only fetching completed applications (filtered by SUCCEEDED or FAILED status). Our spark streaming applications are always RUNNING non-stop (except for weekly restarts). We want to be able to analyze them and generate real time heuristics.
Why does dr-elephant exclude running application ? Is there a way to include them when fetching jobs list?
More details:
We are using SparkFetcher.
Dr. Elephant gets list of only succeeded and failed applications from Yarn History Server API:
And then data are fetched from the Spark History server API:
Running spark application are available in both YARN HS and Spark HS. I can retrieve log events by accessing http://{SHS_HOST}/api/v1/applications/application_xxxxxxxxxx_xxxxxx/1/logs
Thank you