Dr-Elephant not fetching RUNNING spark application (only succeeded and failed applications are fetched)

nelhaj commented 4 years ago

Hi,

Dr-Elephant is only fetching completed applications (filtered by SUCCEEDED or FAILED status). Our spark streaming applications are always RUNNING non-stop (except for weekly restarts). We want to be able to analyze them and generate real time heuristics.

Why does dr-elephant exclude running application ? Is there a way to include them when fetching jobs list?

More details:

We are using SparkFetcher.

Dr. Elephant gets list of only succeeded and failed applications from Yarn History Server API:

applicationscom.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The succeeded apps URL is https://{YARN_RM_HOST}/ws/v1/cluster/apps?finalStatus=SUCCEEDED&finishedTimeBegin=1598632623112&finishedTimeEnd=1598632683376
com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : The failed apps URL is https://{ YARN_RM_HOST}/ws/v1/cluster/apps?finalStatus=FAILED&state=FINISHED&finishedTimeBegin=1598632623112&finishedTimeEnd=1598632683376

And then data are fetched from the Spark History server API:

com.linkedin.drelephant.spark.fetchers.SparkRestClient : calling REST API at http://{SHS_HOST}/api/v1/applications/application_xxxxxxxxxx_xxxxxx
08-28-2020 18:56:09 INFO  [ForkJoinPool-1-worker-3] com.linkedin.drelephant.spark.fetchers.SparkRestClient : creating SparkApplication by calling REST API at http://{SHS_HOST}/api/v1/applications/application_xxxxxxxxxx_xxxxxx/1/logs to get eventlogs

Running spark application are available in both YARN HS and Spark HS. I can retrieve log events by accessing http://{SHS_HOST}/api/v1/applications/application_xxxxxxxxxx_xxxxxx/1/logs

Thank you

nelhaj commented 4 years ago

Hi @ShubhamGupta29 : Could you help us on this subject please. PS: We have made good progress in implementing this feature. it seems to work fine. We can see Spark Streaming Heuristics. We are using the Spark FsFetcher. We will keep you posted on our progress.

I would like to know why the dr-elephant does not support fetching RUNNING applications natively. Is there a reason for this choice (performance, technical constraints, ...).

Thx

ShubhamGupta29 commented 4 years ago

Initially, Dr.Elephant was designed to profile a Hadoop job after finishes. This idea stayed with the Spark Heuristics too. But with the increased demand Spark streaming we do know the importance of a tool to track your jobs' performance.

The reason for not supporting the Spark Streaming applications is the large logs. Currently, SHS doesn't provide any incremental parsing of logs, so if Dr.Elephant analyzes a RUNNING application at some short interval then it has to parse the whole logs every time and with Streaming jobs, this issue becomes critical as their log size keeps on increasing. This will hog the Dr.Elephant's resources and lead to delays in report generation etc. With the batch jobs, the need for real-time profiling is not that missed, so there are challenges to support RUNNING apps in Dr.Elephant.

I would be glad to know how you are approaching these challenges and would try to provide any needed assistance from my end.

nelhaj commented 4 years ago

Hi, @ShubhamGupta29 Thank you for your clear clarification and sorry for the late reply In fact, we are also facing the same performance issues for spark streaming apps analysis.

We try to deal with these problems in the following way :

Increase the analysis fetch interval for streaming applications (example: spark.streaming.analysis.fetch.interval = 10 * analysis.fetch.interval, requires a custom development)
Use of FSFetcher instead of SparkFetcher. FSFetcher is much more stable. This solves timeout and memory overhead on SHS issues
Limit event log file size (using event_log_size_limit_in_mb param). Indeed, a representative dataset of a few hours / days should be sufficient to have relevant heuristics.
Read and parse event log files to disk instead of memory (using leveldb for example, requires a custom development). This should reduce RAM usage but will increase the analysis time
Depending on the complexity, use different queues for batch and streaming applications, in order not to delay the analysis of batch applications (requires a custom development)

Javid-Shaik commented 4 months ago

Hi @nelhaj We're also want to implement the spark streaming jobs analysis so can you please share how you achieved this.

Could you share how you modified Dr. Elephant to fetch and analyze running applications?

Any additional tips or considerations for implementing this feature.

Your insights would be greatly appreciated.

Thank you

linkedin / dr-elephant

Dr-Elephant not fetching RUNNING spark application (only succeeded and failed applications are fetched) #696