linkedin / dr-elephant

Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark

Apache License 2.0

1.35k stars 859 forks source link

Spark 3/ local / Uncompressed #716

Open AbdelrahmanMosly opened 1 year ago

AbdelrahmanMosly commented 1 year ago

PR #357: Uncompressed File Support for Dr Elephant

Using Local Event Logs

Initially, Dr Elephant utilized the YARN Resource Manager to check submitted jobs. However, we made modifications to read from local Spark event logs instead.

If the environment variable USE_YARN is set to true, Dr Elephant will still be able to use the YARN Resource Manager. In this case, it will read and check the logs from the history server of Hadoop (YARN Resource Manager).

Using Uncompressed Files

Dr Elephant originally processed compressed files using codec. We enhanced it to support the reading of uncompressed files.

Spark and Hadoop Versions

Dr Elephant is designed to run on Spark 1.4.0 and Hadoop 2.3.0. However, issues arose when attempting to read event logs generated from Spark 3, as a new listener was introduced that couldn't be identified using the ReplayListenerBus of Spark version 1.4.0. To address this, we implemented a workaround, neglecting the listener named SparkListenerResourceProfileAdded.

Fetchers Configuration

We identified the Spark event logs directory and disabled the Tez Fetcher in the FetcherConf.xml configuration.

Javid-Shaik commented 5 months ago

Hi AbdelrahmanMosly, I am facing an issue while trying to fetch the spark-event logs from the local file system. It is giving me the below error : [[37minfo[0m] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9001 Event Log Location URI: /Users/shaikbasha/spark-events java.io.FileNotFoundException: File /Users/shaikbasha/spark-events does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1144) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1122) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1067) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2002) at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:2129) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:2127) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticsJobsFromEventLogs(AnalyticJobGeneratorHadoop2.java:261) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticJobs(AnalyticJobGeneratorHadoop2.java:291) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:190) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:153) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1918) at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:109) at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:153) at com.linkedin.drelephant.DrElephant.run(DrElephant.java:67) at java.lang.Thread.run(Thread.java:750)

When I run the drelephant with hdfs it is working fine and getting the event logs data into the ui. But i want to fetch the spark-events from the local file system(or any directory not necessarily hdfs) ex- gs:// is it possible? Can you please help me in this it is a bit urgent please see through this as soon as possible.

AbdelrahmanMosly commented 5 months ago

Hi AbdelrahmanMosly, I am facing an issue while trying to fetch the spark-event logs from the local file system. It is giving me the below error : [[37minfo[0m] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9001 Event Log Location URI: /Users/shaikbasha/spark-events java.io.FileNotFoundException: File /Users/shaikbasha/spark-events does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1144) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1122) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1067) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2002) at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:2129) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:2127) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticsJobsFromEventLogs(AnalyticJobGeneratorHadoop2.java:261) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticJobs(AnalyticJobGeneratorHadoop2.java:291) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:190) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:153) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1918) at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:109) at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:153) at com.linkedin.drelephant.DrElephant.run(DrElephant.java:67) at java.lang.Thread.run(Thread.java:750)

When I run the drelephant with hdfs it is working fine and getting the event logs data into the ui. But i want to fetch the spark-events from the local file system(or any directory not necessarily hdfs) ex- gs:// is it possible? Can you please help me in this it is a bit urgent please see through this as soon as possible.

@Javid-Shaik

File Existence: Verify that the directory /Users/shaikbasha/spark-events exists and contains the Spark event logs.
Fetcher Configuration: Update the fetcher configuration in Dr. Elephant to correctly point to the local file system path. This involves editing the configuration files. For example, in app-conf/FetcherConf.xml, ensure the event_log_location_uri is set to the correct local path:
Permissions: Ensure that the user running Dr. Elephant has the necessary permissions to read the files in the specified directory.
Configuration Files: Make sure all other necessary configurations are correctly set up as per the Dr. Elephant setup instructions

Javid-Shaik commented 5 months ago

Hi AbdelrahmanMosly, I am facing an issue while trying to fetch the spark-event logs from the local file system. It is giving me the below error : [�[37minfo�[0m] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9001 Event Log Location URI: /Users/shaikbasha/spark-events java.io.FileNotFoundException: File /Users/shaikbasha/spark-events does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1144) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1122) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1067) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2002) at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:2129) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:2127) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticsJobsFromEventLogs(AnalyticJobGeneratorHadoop2.java:261) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticJobs(AnalyticJobGeneratorHadoop2.java:291) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:190) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:153) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1918) at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:109) at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:153) at com.linkedin.drelephant.DrElephant.run(DrElephant.java:67) at java.lang.Thread.run(Thread.java:750) When I run the drelephant with hdfs it is working fine and getting the event logs data into the ui. But i want to fetch the spark-events from the local file system(or any directory not necessarily hdfs) ex- gs:// is it possible? Can you please help me in this it is a bit urgent please see through this as soon as possible.

@Javid-Shaik

File Existence: Verify that the directory /Users/shaikbasha/spark-events exists and contains the Spark event logs.

Fetcher Configuration: Update the fetcher configuration in Dr. Elephant to correctly point to the local file system path. This involves editing the configuration files. For example, in app-conf/FetcherConf.xml, ensure the event_log_location_uri is set to the correct local path:

Permissions: Ensure that the user running Dr. Elephant has the necessary permissions to read the files in the specified directory.

Configuration Files: Make sure all other necessary configurations are correctly set up as per the Dr. Elephant setup instructions

@AbdelrahmanMosly

Directory is exists and spark-event logs are present in the directory. shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls /Users/shaikbasha/spark-events | tail -n 10 spark-034778d6e9844b97b5fc4217197e0d91 spark-19ce088f2e7b4443a09b32ee1082e546 spark-46a7d8db9504453a816a6d1a98884709 spark-4a5a8432e5c7452e8638de54c8db1297 spark-6befa09607c249e2aa0fc5d2e650f814 spark-a823dfda6b6d4d7481a2f3065de0201e spark-dcf713ca380a41ffbfc578e379c50f59

FetcherConf.xml

spark com.linkedin.drelephant.spark.fetchers.FSFetcher /Users/shaikbasha/spark-events

Permissions are provided shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls -l /Users/shaikbasha/spark-events | tail -n 5 -rw-r--r-- 1 shaikbasha staff 51674335 Jun 21 10:34 spark-46a7d8db9504453a816a6d1a98884709 -rw-r--r-- 1 shaikbasha staff 87653 Jun 20 18:02 spark-4a5a8432e5c7452e8638de54c8db1297

And i have configured the spark and hadoop configurations files correctly.

Please help me on this. As I have already mentioned that dr-elephant is working fine with hdfs. I want it to work with the local FS.

The below error is in the dr_elephant.log file

06-24-2024 20:28:10 INFO [Thread-8] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Event log directory file:///Users/shaikbasha/spark-events 06-24-2024 20:28:10 ERROR [Thread-8] com.linkedin.drelephant.ElephantRunner : Error fetching job list. Try again later... java.lang.IllegalArgumentException: Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:779)

AbdelrahmanMosly commented 5 months ago

Hi AbdelrahmanMosly, I am facing an issue while trying to fetch the spark-event logs from the local file system. It is giving me the below error : [�[37minfo�[0m] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9001 Event Log Location URI: /Users/shaikbasha/spark-events java.io.FileNotFoundException: File /Users/shaikbasha/spark-events does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1144) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1122) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1067) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2002) at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:2129) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:2127) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticsJobsFromEventLogs(AnalyticJobGeneratorHadoop2.java:261) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticJobs(AnalyticJobGeneratorHadoop2.java:291) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:190) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:153) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1918) at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:109) at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:153) at com.linkedin.drelephant.DrElephant.run(DrElephant.java:67) at java.lang.Thread.run(Thread.java:750) When I run the drelephant with hdfs it is working fine and getting the event logs data into the ui. But i want to fetch the spark-events from the local file system(or any directory not necessarily hdfs) ex- gs:// is it possible? Can you please help me in this it is a bit urgent please see through this as soon as possible.

@Javid-Shaik

File Existence: Verify that the directory /Users/shaikbasha/spark-events exists and contains the Spark event logs.

Fetcher Configuration: Update the fetcher configuration in Dr. Elephant to correctly point to the local file system path. This involves editing the configuration files. For example, in app-conf/FetcherConf.xml, ensure the event_log_location_uri is set to the correct local path:

Permissions: Ensure that the user running Dr. Elephant has the necessary permissions to read the files in the specified directory.

Configuration Files: Make sure all other necessary configurations are correctly set up as per the Dr. Elephant setup instructions

@AbdelrahmanMosly

Directory is exists and spark-event logs are present in the directory. shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls /Users/shaikbasha/spark-events | tail -n 10 spark-034778d6e9844b97b5fc4217197e0d91 spark-19ce088f2e7b4443a09b32ee1082e546 spark-46a7d8db9504453a816a6d1a98884709 spark-4a5a8432e5c7452e8638de54c8db1297 spark-6befa09607c249e2aa0fc5d2e650f814 spark-a823dfda6b6d4d7481a2f3065de0201e spark-dcf713ca380a41ffbfc578e379c50f59

FetcherConf.xml
spark com.linkedin.drelephant.spark.fetchers.FSFetcher /Users/shaikbasha/spark-events
Permissions are provided shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls -l /Users/shaikbasha/spark-events | tail -n 5 -rw-r--r-- 1 shaikbasha staff 51674335 Jun 21 10:34 spark-46a7d8db9504453a816a6d1a98884709 -rw-r--r-- 1 shaikbasha staff 87653 Jun 20 18:02 spark-4a5a8432e5c7452e8638de54c8db1297

And i have configured the spark and hadoop configurations files correctly.

Please help me on this. As I have already mentioned that dr-elephant is working fine with hdfs. I want it to work with the local FS.

The below error is in the dr_elephant.log file

06-24-2024 20:28:10 INFO [Thread-8] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Event log directory file:///Users/shaikbasha/spark-events 06-24-2024 20:28:10 ERROR [Thread-8] com.linkedin.drelephant.ElephantRunner : Error fetching job list. Try again later... java.lang.IllegalArgumentException: Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:779)

@Javid-Shaik

First i recommend you to chrck this commit : https://github.com/linkedin/dr-elephant/pull/716/commits/eb6092bff701e4c4afea34eb8e22c23983934781

Based on the error message Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020, it seems that Dr. Elephant is configured to expect HDFS by default, but you're trying to fetch logs from the local file system. This discrepancy causes the error.

Check core-site.xml Configuration:
- Ensure that the Hadoop configuration (core-site.xml) is set up to handle local file system paths.
- You might need to specify fs.defaultFS as file:/// for the local file system. Here’s an example configuration for core-site.xml:
```
<configuration>
<property>
 <name>fs.defaultFS</name>
 <value>file:///</value>
</property>
</configuration>
```
Check for Hardcoded HDFS References:
- Review the Dr. Elephant source code or configurations for any hardcoded references to HDFS. Ensure that these are flexible enough to support the local file system.
Ensure Correct FileSystem Class:
- Ensure the correct FileSystem implementation is being used. For local file system, LocalFileSystem should be used instead of HDFS. You may need to set this explicitly in your configuration.
Restart Dr. Elephant:
- After making these changes, restart Dr. Elephant to apply the new configurations.

Javid-Shaik commented 5 months ago

Hi AbdelrahmanMosly, I am facing an issue while trying to fetch the spark-event logs from the local file system. It is giving me the below error : [�[37minfo�[0m] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9001 Event Log Location URI: /Users/shaikbasha/spark-events java.io.FileNotFoundException: File /Users/shaikbasha/spark-events does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1144) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1122) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1067) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2002) at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:2129) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:2127) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticsJobsFromEventLogs(AnalyticJobGeneratorHadoop2.java:261) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticJobs(AnalyticJobGeneratorHadoop2.java:291) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:190) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:153) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1918) at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:109) at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:153) at com.linkedin.drelephant.DrElephant.run(DrElephant.java:67) at java.lang.Thread.run(Thread.java:750) When I run the drelephant with hdfs it is working fine and getting the event logs data into the ui. But i want to fetch the spark-events from the local file system(or any directory not necessarily hdfs) ex- gs:// is it possible? Can you please help me in this it is a bit urgent please see through this as soon as possible.

@Javid-Shaik

File Existence: Verify that the directory /Users/shaikbasha/spark-events exists and contains the Spark event logs.

Fetcher Configuration: Update the fetcher configuration in Dr. Elephant to correctly point to the local file system path. This involves editing the configuration files. For example, in app-conf/FetcherConf.xml, ensure the event_log_location_uri is set to the correct local path:

Permissions: Ensure that the user running Dr. Elephant has the necessary permissions to read the files in the specified directory.

Configuration Files: Make sure all other necessary configurations are correctly set up as per the Dr. Elephant setup instructions

@AbdelrahmanMosly Directory is exists and spark-event logs are present in the directory. shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls /Users/shaikbasha/spark-events | tail -n 10 spark-034778d6e9844b97b5fc4217197e0d91 spark-19ce088f2e7b4443a09b32ee1082e546 spark-46a7d8db9504453a816a6d1a98884709 spark-4a5a8432e5c7452e8638de54c8db1297 spark-6befa09607c249e2aa0fc5d2e650f814 spark-a823dfda6b6d4d7481a2f3065de0201e spark-dcf713ca380a41ffbfc578e379c50f59 FetcherConf.xml spark com.linkedin.drelephant.spark.fetchers.FSFetcher /Users/shaikbasha/spark-events
Permissions are provided shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls -l /Users/shaikbasha/spark-events | tail -n 5 -rw-r--r-- 1 shaikbasha staff 51674335 Jun 21 10:34 spark-46a7d8db9504453a816a6d1a98884709 -rw-r--r-- 1 shaikbasha staff 87653 Jun 20 18:02 spark-4a5a8432e5c7452e8638de54c8db1297 And i have configured the spark and hadoop configurations files correctly. Please help me on this. As I have already mentioned that dr-elephant is working fine with hdfs. I want it to work with the local FS. The below error is in the dr_elephant.log file 06-24-2024 20:28:10 INFO [Thread-8] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Event log directory file:///Users/shaikbasha/spark-events 06-24-2024 20:28:10 ERROR [Thread-8] com.linkedin.drelephant.ElephantRunner : Error fetching job list. Try again later... java.lang.IllegalArgumentException: Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:779)

@Javid-Shaik

First i recommend you to chrck this commit : eb6092b

Based on the error message Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020, it seems that Dr. Elephant is configured to expect HDFS by default, but you're trying to fetch logs from the local file system. This discrepancy causes the error.
Check core-site.xml Configuration:
Ensure that the Hadoop configuration (core-site.xml) is set up to handle local file system paths.
You might need to specify fs.defaultFS as file:/// for the local file system. Here’s an example configuration for core-site.xml:
<configuration>
<property>
 <name>fs.defaultFS</name>
 <value>file:///</value>
</property>
</configuration>
Check for Hardcoded HDFS References:

Review the Dr. Elephant source code or configurations for any hardcoded references to HDFS. Ensure that these are flexible enough to support the local file system.

Ensure Correct FileSystem Class:

Ensure the correct FileSystem implementation is being used. For local file system, LocalFileSystem should be used instead of HDFS. You may need to set this explicitly in your configuration.

Restart Dr. Elephant:

After making these changes, restart Dr. Elephant to apply the new configurations.

Thank you @AbdelrahmanMosly After changing the default.Fs to file:/// in the core-site.xml I was able to get the data into the dr-elephant ui. And I observed that we should just need to start the spark history server no need to start the mr-jobhistory-server.

But then I am getting this error in the dr.log : Event Log Location URI: /Users/shaikbasha/spark-events [[31merror[0m] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: file:/Users/shaikbasha/spark-events/spark-034778d6e9844b97b5fc4217197e0d91 org.json4s.package$MappingException: Did not find value which can be converted into boolean at org.json4s.reflect.package$.fail(package.scala:96) ~[org.json4s.json4s-core_2.10-3.2.10.jar:3.2.10] [[31merror[0m] o.a.s.s.ReplayListenerBus - Malformed line #9: {"Event":"SparkListenerJobStart" ...... }

Can you please help in resolving this error.

AbdelrahmanMosly commented 5 months ago

Hi AbdelrahmanMosly, I am facing an issue while trying to fetch the spark-event logs from the local file system. It is giving me the below error : [�[37minfo�[0m] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9001 Event Log Location URI: /Users/shaikbasha/spark-events java.io.FileNotFoundException: File /Users/shaikbasha/spark-events does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1144) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1122) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1067) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2002) at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:2129) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:2127) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticsJobsFromEventLogs(AnalyticJobGeneratorHadoop2.java:261) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticJobs(AnalyticJobGeneratorHadoop2.java:291) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:190) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:153) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1918) at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:109) at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:153) at com.linkedin.drelephant.DrElephant.run(DrElephant.java:67) at java.lang.Thread.run(Thread.java:750) When I run the drelephant with hdfs it is working fine and getting the event logs data into the ui. But i want to fetch the spark-events from the local file system(or any directory not necessarily hdfs) ex- gs:// is it possible? Can you please help me in this it is a bit urgent please see through this as soon as possible.

@Javid-Shaik

File Existence: Verify that the directory /Users/shaikbasha/spark-events exists and contains the Spark event logs.

Fetcher Configuration: Update the fetcher configuration in Dr. Elephant to correctly point to the local file system path. This involves editing the configuration files. For example, in app-conf/FetcherConf.xml, ensure the event_log_location_uri is set to the correct local path:

Permissions: Ensure that the user running Dr. Elephant has the necessary permissions to read the files in the specified directory.

Configuration Files: Make sure all other necessary configurations are correctly set up as per the Dr. Elephant setup instructions

@AbdelrahmanMosly Directory is exists and spark-event logs are present in the directory. shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls /Users/shaikbasha/spark-events | tail -n 10 spark-034778d6e9844b97b5fc4217197e0d91 spark-19ce088f2e7b4443a09b32ee1082e546 spark-46a7d8db9504453a816a6d1a98884709 spark-4a5a8432e5c7452e8638de54c8db1297 spark-6befa09607c249e2aa0fc5d2e650f814 spark-a823dfda6b6d4d7481a2f3065de0201e spark-dcf713ca380a41ffbfc578e379c50f59 FetcherConf.xml spark com.linkedin.drelephant.spark.fetchers.FSFetcher /Users/shaikbasha/spark-events Permissions are provided shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls -l /Users/shaikbasha/spark-events | tail -n 5 -rw-r--r-- 1 shaikbasha staff 51674335 Jun 21 10:34 spark-46a7d8db9504453a816a6d1a98884709 -rw-r--r-- 1 shaikbasha staff 87653 Jun 20 18:02 spark-4a5a8432e5c7452e8638de54c8db1297 And i have configured the spark and hadoop configurations files correctly. Please help me on this. As I have already mentioned that dr-elephant is working fine with hdfs. I want it to work with the local FS. The below error is in the dr_elephant.log file 06-24-2024 20:28:10 INFO [Thread-8] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Event log directory file:///Users/shaikbasha/spark-events 06-24-2024 20:28:10 ERROR [Thread-8] com.linkedin.drelephant.ElephantRunner : Error fetching job list. Try again later... java.lang.IllegalArgumentException: Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:779)

@Javid-Shaik First i recommend you to chrck this commit : eb6092b Based on the error message Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020, it seems that Dr. Elephant is configured to expect HDFS by default, but you're trying to fetch logs from the local file system. This discrepancy causes the error.
Check core-site.xml Configuration:
Ensure that the Hadoop configuration (core-site.xml) is set up to handle local file system paths.
You might need to specify fs.defaultFS as file:/// for the local file system. Here’s an example configuration for core-site.xml:
<configuration>
<property>
 <name>fs.defaultFS</name>
 <value>file:///</value>
</property>
</configuration>
Check for Hardcoded HDFS References:

Review the Dr. Elephant source code or configurations for any hardcoded references to HDFS. Ensure that these are flexible enough to support the local file system.

Ensure Correct FileSystem Class:

Ensure the correct FileSystem implementation is being used. For local file system, LocalFileSystem should be used instead of HDFS. You may need to set this explicitly in your configuration.

Restart Dr. Elephant:

After making these changes, restart Dr. Elephant to apply the new configurations.
Thank you @AbdelrahmanMosly After changing the default.Fs to file:/// in the core-site.xml I was able to get the data into the dr-elephant ui. And I observed that we should just need to start the spark history server no need to start the mr-jobhistory-server.

But then I am getting this error in the dr.log : Event Log Location URI: /Users/shaikbasha/spark-events [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: file:/Users/shaikbasha/spark-events/spark-034778d6e9844b97b5fc4217197e0d91 org.json4s.package$MappingException: Did not find value which can be converted into boolean at org.json4s.reflect.package$.fail(package.scala:96) ~[org.json4s.json4s-core_2.10-3.2.10.jar:3.2.10] [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Malformed line #9: {"Event":"SparkListenerJobStart" ...... }

Can you please help in resolving this error.

@Javid-Shaik I believe the issue is related to differences in Spark versions. Spark 1.x, 2.x, and 3.x have variations in the event listeners they use. Dr. Elephant was originally designed for Spark 1.x and was later adapted for Spark 2.x in some pull requests.

If you check my PR, you'll see that to make Dr. Elephant compatible with Spark 3.x, I had to modify the listeners. Spark 3.x introduced new listeners and removed some of the existing ones, which required adjustments in the event log parsing logic.

Additionally you can check those commits https://github.com/linkedin/dr-elephant/pull/716/commits/71e6f2c4da0b8ea7f29521af75eb886eab54f508

https://github.com/linkedin/dr-elephant/pull/716/commits/a1e6c67c72e7ace5ed9ddbcd33543ccedeb71250

Javid-Shaik commented 5 months ago

@AbdelrahmanMosly I have done everything as you have told me to do but then I am getting this new error [[31merror[0m] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: file:/Users/shaikbasha/spark-events/spark-46a7d8db9504453a816a6d1a98884709 scala.MatchError: org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart (of class java.lang.String) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:466) ~[org.apache.spark.spark-core_2.10-1.4.0.jar:1.4.0] at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) ~[org.apache.spark.spark-core_2.10-1.4.0.jar:1.4.0] at org.apache.spark.deploy.history.SparkDataCollection.load(SparkDataCollection.scala:310) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$doFetchData$1.apply(SparkFSFetcher.scala:105) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$doFetchData$1.apply(SparkFSFetcher.scala:104) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55) [org.scala-lang.scala-library-2.10.4.jar:na] [[31merror[0m] o.a.s.s.ReplayListenerBus - Malformed line #5: {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart"

Hi AbdelrahmanMosly, I am facing an issue while trying to fetch the spark-event logs from the local file system. It is giving me the below error : [�[37minfo�[0m] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9001 Event Log Location URI: /Users/shaikbasha/spark-events java.io.FileNotFoundException: File /Users/shaikbasha/spark-events does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1144) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1122) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1067) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2002) at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:2129) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:2127) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticsJobsFromEventLogs(AnalyticJobGeneratorHadoop2.java:261) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticJobs(AnalyticJobGeneratorHadoop2.java:291) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:190) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:153) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1918) at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:109) at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:153) at com.linkedin.drelephant.DrElephant.run(DrElephant.java:67) at java.lang.Thread.run(Thread.java:750) When I run the drelephant with hdfs it is working fine and getting the event logs data into the ui. But i want to fetch the spark-events from the local file system(or any directory not necessarily hdfs) ex- gs:// is it possible? Can you please help me in this it is a bit urgent please see through this as soon as possible.

@Javid-Shaik

File Existence: Verify that the directory /Users/shaikbasha/spark-events exists and contains the Spark event logs.

Fetcher Configuration: Update the fetcher configuration in Dr. Elephant to correctly point to the local file system path. This involves editing the configuration files. For example, in app-conf/FetcherConf.xml, ensure the event_log_location_uri is set to the correct local path:

Permissions: Ensure that the user running Dr. Elephant has the necessary permissions to read the files in the specified directory.

Configuration Files: Make sure all other necessary configurations are correctly set up as per the Dr. Elephant setup instructions

@AbdelrahmanMosly Directory is exists and spark-event logs are present in the directory. shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls /Users/shaikbasha/spark-events | tail -n 10 spark-034778d6e9844b97b5fc4217197e0d91 spark-19ce088f2e7b4443a09b32ee1082e546 spark-46a7d8db9504453a816a6d1a98884709 spark-4a5a8432e5c7452e8638de54c8db1297 spark-6befa09607c249e2aa0fc5d2e650f814 spark-a823dfda6b6d4d7481a2f3065de0201e spark-dcf713ca380a41ffbfc578e379c50f59 FetcherConf.xml spark com.linkedin.drelephant.spark.fetchers.FSFetcher /Users/shaikbasha/spark-events Permissions are provided shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls -l /Users/shaikbasha/spark-events | tail -n 5 -rw-r--r-- 1 shaikbasha staff 51674335 Jun 21 10:34 spark-46a7d8db9504453a816a6d1a98884709 -rw-r--r-- 1 shaikbasha staff 87653 Jun 20 18:02 spark-4a5a8432e5c7452e8638de54c8db1297 And i have configured the spark and hadoop configurations files correctly. Please help me on this. As I have already mentioned that dr-elephant is working fine with hdfs. I want it to work with the local FS. The below error is in the dr_elephant.log file 06-24-2024 20:28:10 INFO [Thread-8] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Event log directory file:///Users/shaikbasha/spark-events 06-24-2024 20:28:10 ERROR [Thread-8] com.linkedin.drelephant.ElephantRunner : Error fetching job list. Try again later... java.lang.IllegalArgumentException: Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:779)

@Javid-Shaik First i recommend you to chrck this commit : eb6092b Based on the error message Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020, it seems that Dr. Elephant is configured to expect HDFS by default, but you're trying to fetch logs from the local file system. This discrepancy causes the error.
Check core-site.xml Configuration:
Ensure that the Hadoop configuration (core-site.xml) is set up to handle local file system paths.
You might need to specify fs.defaultFS as file:/// for the local file system. Here’s an example configuration for core-site.xml:
<configuration>
<property>
 <name>fs.defaultFS</name>
 <value>file:///</value>
</property>
</configuration>
Check for Hardcoded HDFS References:

Review the Dr. Elephant source code or configurations for any hardcoded references to HDFS. Ensure that these are flexible enough to support the local file system.

Ensure Correct FileSystem Class:

Ensure the correct FileSystem implementation is being used. For local file system, LocalFileSystem should be used instead of HDFS. You may need to set this explicitly in your configuration.

Restart Dr. Elephant:

After making these changes, restart Dr. Elephant to apply the new configurations.
Thank you @AbdelrahmanMosly After changing the default.Fs to file:/// in the core-site.xml I was able to get the data into the dr-elephant ui. And I observed that we should just need to start the spark history server no need to start the mr-jobhistory-server. But then I am getting this error in the dr.log : Event Log Location URI: /Users/shaikbasha/spark-events [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: file:/Users/shaikbasha/spark-events/spark-034778d6e9844b97b5fc4217197e0d91 org.json4s.package$MappingException: Did not find value which can be converted into boolean at org.json4s.reflect.package$.fail(package.scala:96) ~[org.json4s.json4s-core_2.10-3.2.10.jar:3.2.10] [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Malformed line #9: {"Event":"SparkListenerJobStart"...... } Can you please help in resolving this error.
@Javid-Shaik I believe the issue is related to differences in Spark versions. Spark 1.x, 2.x, and 3.x have variations in the event listeners they use. Dr. Elephant was originally designed for Spark 1.x and was later adapted for Spark 2.x in some pull requests.

If you check my PR, you'll see that to make Dr. Elephant compatible with Spark 3.x, I had to modify the listeners. Spark 3.x introduced new listeners and removed some of the existing ones, which required adjustments in the event log parsing logic.

Additionally you can check those commits 71e6f2c

a1e6c67

@AbdelrahmanMosly I have done everything as you have told me to do but then I am getting this new error along with the previous error. [[31merror[0m] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: file:/Users/shaikbasha/spark-events/spark-46a7d8db9504453a816a6d1a98884709 scala.MatchError: org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart (of class java.lang.String) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:466) ~[org.apache.spark.spark-core_2.10-1.4.0.jar:1.4.0] at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) ~[org.apache.spark.spark-core_2.10-1.4.0.jar:1.4.0] at org.apache.spark.deploy.history.SparkDataCollection.load(SparkDataCollection.scala:310) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$doFetchData$1.apply(SparkFSFetcher.scala:105) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$doFetchData$1.apply(SparkFSFetcher.scala:104) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55) [org.scala-lang.scala-library-2.10.4.jar:na] [[31merror[0m] o.a.s.s.ReplayListenerBus - Malformed line #5: {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart" ... }

And the ui is like in the above pictures giving the wrong data. Please help me in this.

AbdelrahmanMosly commented 5 months ago

@AbdelrahmanMosly I have done everything as you have told me to do but then I am getting this new error [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: file:/Users/shaikbasha/spark-events/spark-46a7d8db9504453a816a6d1a98884709 scala.MatchError: org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart (of class java.lang.String) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:466) ~[org.apache.spark.spark-core_2.10-1.4.0.jar:1.4.0] at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) ~[org.apache.spark.spark-core_2.10-1.4.0.jar:1.4.0] at org.apache.spark.deploy.history.SparkDataCollection.load(SparkDataCollection.scala:310) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$doFetchData$1.apply(SparkFSFetcher.scala:105) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$doFetchData$1.apply(SparkFSFetcher.scala:104) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55) [org.scala-lang.scala-library-2.10.4.jar:na] [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Malformed line #5: {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart"
Hi AbdelrahmanMosly, I am facing an issue while trying to fetch the spark-event logs from the local file system. It is giving me the below error : [�[37minfo�[0m] play - Listening for HTTP on /0:0:0:0:0:0:0:0:9001 Event Log Location URI: /Users/shaikbasha/spark-events java.io.FileNotFoundException: File /Users/shaikbasha/spark-events does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1144) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1122) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1067) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1063) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2002) at org.apache.hadoop.fs.FileSystem$5.(FileSystem.java:2129) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:2127) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticsJobsFromEventLogs(AnalyticJobGeneratorHadoop2.java:261) at com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2.fetchAnalyticJobs(AnalyticJobGeneratorHadoop2.java:291) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:190) at com.linkedin.drelephant.ElephantRunner$1.run(ElephantRunner.java:153) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1918) at com.linkedin.drelephant.security.HadoopSecurity.doAs(HadoopSecurity.java:109) at com.linkedin.drelephant.ElephantRunner.run(ElephantRunner.java:153) at com.linkedin.drelephant.DrElephant.run(DrElephant.java:67) at java.lang.Thread.run(Thread.java:750) When I run the drelephant with hdfs it is working fine and getting the event logs data into the ui. But i want to fetch the spark-events from the local file system(or any directory not necessarily hdfs) ex- gs:// is it possible? Can you please help me in this it is a bit urgent please see through this as soon as possible.

@Javid-Shaik

File Existence: Verify that the directory /Users/shaikbasha/spark-events exists and contains the Spark event logs.

Fetcher Configuration: Update the fetcher configuration in Dr. Elephant to correctly point to the local file system path. This involves editing the configuration files. For example, in app-conf/FetcherConf.xml, ensure the event_log_location_uri is set to the correct local path:

Permissions: Ensure that the user running Dr. Elephant has the necessary permissions to read the files in the specified directory.

Configuration Files: Make sure all other necessary configurations are correctly set up as per the Dr. Elephant setup instructions

@AbdelrahmanMosly Directory is exists and spark-event logs are present in the directory. shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls /Users/shaikbasha/spark-events | tail -n 10 spark-034778d6e9844b97b5fc4217197e0d91 spark-19ce088f2e7b4443a09b32ee1082e546 spark-46a7d8db9504453a816a6d1a98884709 spark-4a5a8432e5c7452e8638de54c8db1297 spark-6befa09607c249e2aa0fc5d2e650f814 spark-a823dfda6b6d4d7481a2f3065de0201e spark-dcf713ca380a41ffbfc578e379c50f59 FetcherConf.xml spark com.linkedin.drelephant.spark.fetchers.FSFetcher /Users/shaikbasha/spark-events Permissions are provided shaikbasha@C02G144RMD6M dr-elephant-2.1.7 % ls -l /Users/shaikbasha/spark-events | tail -n 5 -rw-r--r-- 1 shaikbasha staff 51674335 Jun 21 10:34 spark-46a7d8db9504453a816a6d1a98884709 -rw-r--r-- 1 shaikbasha staff 87653 Jun 20 18:02 spark-4a5a8432e5c7452e8638de54c8db1297 And i have configured the spark and hadoop configurations files correctly. Please help me on this. As I have already mentioned that dr-elephant is working fine with hdfs. I want it to work with the local FS. The below error is in the dr_elephant.log file 06-24-2024 20:28:10 INFO [Thread-8] com.linkedin.drelephant.analysis.AnalyticJobGeneratorHadoop2 : Event log directory file:///Users/shaikbasha/spark-events 06-24-2024 20:28:10 ERROR [Thread-8] com.linkedin.drelephant.ElephantRunner : Error fetching job list. Try again later... java.lang.IllegalArgumentException: Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:779)

@Javid-Shaik First i recommend you to chrck this commit : eb6092b Based on the error message Wrong FS: file:/Users/shaikbasha/spark-events, expected: hdfs://localhost:8020, it seems that Dr. Elephant is configured to expect HDFS by default, but you're trying to fetch logs from the local file system. This discrepancy causes the error.
Check core-site.xml Configuration:
Ensure that the Hadoop configuration (core-site.xml) is set up to handle local file system paths.
You might need to specify fs.defaultFS as file:/// for the local file system. Here’s an example configuration for core-site.xml:
<configuration>
<property>
 <name>fs.defaultFS</name>
 <value>file:///</value>
</property>
</configuration>
Check for Hardcoded HDFS References:

Review the Dr. Elephant source code or configurations for any hardcoded references to HDFS. Ensure that these are flexible enough to support the local file system.

Ensure Correct FileSystem Class:

Ensure the correct FileSystem implementation is being used. For local file system, LocalFileSystem should be used instead of HDFS. You may need to set this explicitly in your configuration.

Restart Dr. Elephant:

After making these changes, restart Dr. Elephant to apply the new configurations.
Thank you @AbdelrahmanMosly After changing the default.Fs to file:/// in the core-site.xml I was able to get the data into the dr-elephant ui. And I observed that we should just need to start the spark history server no need to start the mr-jobhistory-server. But then I am getting this error in the dr.log : Event Log Location URI: /Users/shaikbasha/spark-events [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: file:/Users/shaikbasha/spark-events/spark-034778d6e9844b97b5fc4217197e0d91 org.json4s.package$MappingException: Did not find value which can be converted into boolean at org.json4s.reflect.package$.fail(package.scala:96) ~[org.json4s.json4s-core_2.10-3.2.10.jar:3.2.10] [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Malformed line #9: {"Event":"SparkListenerJobStart"...... } Can you please help in resolving this error.
@Javid-Shaik I believe the issue is related to differences in Spark versions. Spark 1.x, 2.x, and 3.x have variations in the event listeners they use. Dr. Elephant was originally designed for Spark 1.x and was later adapted for Spark 2.x in some pull requests. If you check my PR, you'll see that to make Dr. Elephant compatible with Spark 3.x, I had to modify the listeners. Spark 3.x introduced new listeners and removed some of the existing ones, which required adjustments in the event log parsing logic. Additionally you can check those commits 71e6f2c a1e6c67
@AbdelrahmanMosly I have done everything as you have told me to do but then I am getting this new error along with the previous error. [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: file:/Users/shaikbasha/spark-events/spark-46a7d8db9504453a816a6d1a98884709 scala.MatchError: org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart (of class java.lang.String) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:466) ~[org.apache.spark.spark-core_2.10-1.4.0.jar:1.4.0] at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) ~[org.apache.spark.spark-core_2.10-1.4.0.jar:1.4.0] at org.apache.spark.deploy.history.SparkDataCollection.load(SparkDataCollection.scala:310) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$doFetchData$1.apply(SparkFSFetcher.scala:105) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at org.apache.spark.deploy.history.SparkFSFetcher$$anonfun$doFetchData$1.apply(SparkFSFetcher.scala:104) [com.linkedin.drelephant.dr-elephant-2.1.7.jar:2.1.7] at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55) [org.scala-lang.scala-library-2.10.4.jar:na] [�[31merror�[0m] o.a.s.s.ReplayListenerBus - Malformed line #5: {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart" ... }

And the ui is like in the above pictures giving the wrong data. Please help me in this. @Javid-Shaik

make sure the metrics you need are present in Spark Eventlog

I am getting confused about which spark version you use as I see from this error [org.apache.spark.spark-core_2.10-1.4.0.jar:1.4.0] you are using spark version 1.4. this should work straightforwardly. There is no need to play with replay listeners as long as I remember

if your whole problem was just to read from local you need to change configs. There is no need to change the code

Javid-Shaik commented 5 months ago

@AbdelrahmanMosly Well I am using Spark-3.5.1. I have compiled the dr-elephant with the default spark version 1.4.0.

AbdelrahmanMosly commented 5 months ago

@Javid-Shaik

There are discrepancies with event logs due to differences in event types between Spark versions.

Current Status:

The UI is functional and displays some application data.
There are issues with missing or renamed events in the Spark event logs.

Next Steps:

Identify Missing Events:
- Review the event logs to identify which events are missing or have changed names between Spark 1.4.0 and Spark 3.5.1.
Customize Event Parsing:
- Update Dr. Elephant’s event parsing logic to handle the new or renamed events in Spark 3.5.1.

Congratulations on getting the basic UI and some events parsed! The next step involves customizing the event parsing to ensure all necessary data is captured from Spark 3.5.1 logs.

Javid-Shaik commented 5 months ago

@AbdelrahmanMosly First of all thank you for your prompt assistance and for providing clear directions to solve the problems. I have identified some new events that are not present in the spark-1.4.0 i.e newly added in the spark-3.5.1.

These are the newly added events in spark-3.5.1 :

1. org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart
2. org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEnd
3. org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate
4. org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEndAdaptiveSQLMetricUpdates
5. org.apache.spark.sql.execution.ui.SparkListenerDriverAccumUpdates
6. SparkListenerJobStart in this event the structure is changed

Can you please give me the head start to update Dr. Elephant’s event parsing logic to handle the new or renamed events in Spark 3.5.1. If needed I will share the event-structure.

AbdelrahmanMosly commented 5 months ago

@Javid-Shaik You can check the code present in SparkDataCollection.scala. Additionally, look at the documentation of Spark's replay listener to understand how to catch those listeners.

In the worst-case scenario, you can parse the JSON of the event logs directly.

Javid-Shaik commented 5 months ago

@AbdelrahmanMosly Hey I found that the events

SparkListenerJobStart
SparkListenerStageSubmitted
May be some other events

I have observed that the events SparkListenerJobStart and SparkListenerStageSubmitted have undergone changes in their structure between Spark versions 1.4.0 and 3.5.1. For example, fields such as "DeterministicLevel":"DETERMINATE" are present in Spark 3.5.1 but not in Spark 1.4.0, along with several other modifications.

As you have told me I have observed the code in the SparkDataCollection.scala. but not sure what to do where to modify the code.

Could you please assist me in identifying the relevant sections of the code and provide recommendations on how to adjust the parsing logic to handle these discrepancies between the Spark versions?

For the reference purpose you can see the below event.

    {"Event":"SparkListenerStageSubmitted","Stage Info":{"Stage ID":0,"Stage Attempt ID":0,"Stage Name":"reduce at SparkPi.scala:38","Number of Tasks":2,"RDD Info":[{"RDD ID":1,"Name":"MapPartitionsRDD","Scope":"{\"id\":\"1\",\"name\":\"map\"}","Callsite":"map at SparkPi.scala:34","Parent IDs":[0],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use Off Heap":false,"Deserialized":false,"Replication":1},"Barrier":false,"DeterministicLevel":"DETERMINATE","Number of Partitions":2,"Number of Cached Partitions":0,"Memory Size":0,"Disk Size":0},{"RDD ID":0,"Name":"ParallelCollectionRDD","Scope":"{\"id\":\"0\",\"name\":\"parallelize\"}","Callsite":"parallelize at SparkPi.scala:34","Parent IDs":[],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use Off Heap":false,"Deserialized":false,"Replication":1},"Barrier":false,"DeterministicLevel":"DETERMINATE","Number of Partitions":2,"Number of Cached Partitions":0,"Memory Size":0,"Disk Size":0}],"Parent IDs":[],"Details":"some details","Submission Time":1715204029859,"Accumulables":[],"Resource Profile Id":0,"Shuffle Push Enabled":false,"Shuffle Push Mergers Count":0},"Properties":{"spark.rdd.scope":"{\"id\":\"2\",\"name\":\"reduce\"}","resource.executor.cores":"1","spark.rdd.scope.noOverride":"true"}}```

    {"Event":"SparkListenerStageSubmitted","Stage Info":{"Stage ID":0,"Stage Attempt ID":0,"Stage Name":"reduce at pi.py:39","Number of Tasks":10,"RDD Info":[{"RDD ID":1,"Name":"PythonRDD",                                           "Parent IDs":[0],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},                     "Number of Partitions":10,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0},{"RDD ID":0,"Name":"ParallelCollectionRDD","Scope":"{\"id\":\"0\",\"name\":\"parallelize\"}","Parent IDs":[],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},                                            "Number of Partitions":10,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0}],"Parent IDs":[],"Details":"","Submission Time":1458126390256,"Accumulables":[]},                  "Properties":{"spark.rdd.scope.noOverride":"true","spark.rdd.scope":"{\"id\":\"1\",\"name\":\"collect\"}","callSite.short":"reduce at pi.py:39"}}

AbdelrahmanMosly commented 5 months ago

@Javid-Shaik

Identify New Fields and Events:
- Determine new fields and events added in Spark 3.5.1 compared to Spark 1.4.0.
Locate Event Parsing Logic:
- Focus on the load method in SparkDataCollection.scala where ReplayListenerBus is used.
Modify Event Listeners:
- Update existing listeners to handle new fields and events.
- Ensure the listeners properly parse and store the new event data.
Add Handlers for New Events:
- Create or update methods to handle new events
Integrate Changes:
- Ensure the new parsing logic is integrated into the load method and other relevant parts of the code.

Javid-Shaik commented 4 months ago

Hi @AbdelrahmanMosly I have identified the new events and fields that were added in spark-3.5.1 and removed thos events. Now the ui is showing the correct data but then I am getting this error

07-10-2024 11:37:43 ERROR [dr-el-executor-thread-0] com.linkedin.drelephant.ElephantRunner : Failed to analyze SPARK spark-05981aeb46fb4816b20a62ae2fdf6041 javax.persistence.PersistenceException: ERROR executing DML bindLog[] error [Duplicate entry 'spark-05981aeb46fb4816b20a62ae2fdf6041' for key 'yarn_app_result.PRIMARY'] at com.avaje.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:97) at com.avaje.ebeaninternal.server.persist.dml.DmlBeanPersister.insert(DmlBeanPersister.java:57) at com.avaje.ebeaninternal.server.persist.DefaultPersistExecute.executeInsertBean(DefaultPersistExecute.java:66) at com.avaje.ebeaninternal.server.core.PersistRequestBean.executeNow(PersistRequestBean.java:448) at com.avaje.ebeaninternal.server.core.PersistRequestBean.executeOrQueue(PersistRequestBean.java:478) at com.avaje.ebeaninternal.server.persist.DefaultPersister.insert(DefaultPersister.java:335) at com.avaje.ebeaninternal.server.persist.DefaultPersister.saveEnhanced(DefaultPersister.java:310) at com.avaje.ebeaninternal.server.persist.DefaultPersister.saveRecurse(DefaultPersister.java:280) at com.avaje.ebeaninternal.server.persist.DefaultPersister.save(DefaultPersister.java:248) at com.avaje.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1568) at com.avaje.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1558) at com.avaje.ebean.Ebean.save(Ebean.java:453) at play.db.ebean.Model.save(Model.java:91) at com.linkedin.drelephant.ElephantRunner$ExecutorJob$1.run(ElephantRunner.java:399) at com.avaje.ebeaninternal.server.core.DefaultServer.execute(DefaultServer.java:699) at com.avaje.ebeaninternal.server.core.DefaultServer.execute(DefaultServer.java:693) at com.avaje.ebean.Ebean.execute(Ebean.java:1207) at com.linkedin.drelephant.ElephantRunner$ExecutorJob.run(ElephantRunner.java:397) at com.linkedin.drelephant.priorityexecutor.RunnableWithPriority$1.run(RunnableWithPriority.java:36) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.sql.SQLIntegrityConstraintViolationException: Duplicate entry 'spark-05981aeb46fb4816b20a62ae2fdf6041' for key 'yarn_app_result.PRIMARY' at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:118) at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122) at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:912) at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1054) at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1003) at com.mysql.cj.jdbc.ClientPreparedStatement.executeLargeUpdate(ClientPreparedStatement.java:1312) at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdate(ClientPreparedStatement.java:988) at com.jolbox.bonecp.PreparedStatementHandle.executeUpdate(PreparedStatementHandle.java:205) at com.avaje.ebeaninternal.server.type.DataBind.executeUpdate(DataBind.java:55) at com.avaje.ebeaninternal.server.persist.dml.InsertHandler.execute(InsertHandler.java:134) at com.avaje.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:86) ... 23 more Please help me in resolving this error.

AbdelrahmanMosly commented 4 months ago

@Javid-Shaik i dont remember encountering this error but simply you can check for any duplicate as this error indicates

Javid-Shaik commented 4 months ago

@AbdelrahmanMosly It occurs when I restarts the Dr Elephant server. For the first time no error only when I restarts it then this error is occurring.

And also can you please tell me how to get JobExecution Url, Flow Execution Url, Job Definition Url etc. Currently I am getting only the spark history server url of the spark job on the dr elephant ui but not the rest of the urls.

And also is it possible to analyze the streaming jobs.

Currently DrElephant analyzes the batch jobs i.e the event logs of already completed application if it is please give me a lead on how to analyze the streaming jobs.

AbdelrahmanMosly commented 4 months ago

@Javid-Shaik For the duplicate entry error, it’s likely due to something in the Dr. Elephant database, such as a job being recorded twice. You might need to identify and delete the duplicate entries in the database to resolve this issue.

Regarding the URLs, you need to ensure that your configuration includes the scheduler URLs to integrate properly. Here's an example of the configuration you should add:

azkaban.execution.url=<your-azkaban-execution-url>
oozie.base.url=<your-oozie-base-url>
airflow.base.url=<your-airflow-base-url>

These configurations are necessary because Spark event logs alone are not sufficient for this task.

Javid-Shaik commented 4 months ago

@AbdelrahmanMosly Does this mean that the spark jobs need be submitted via a scheduler?

And also please tell me whether it is possible to analyze the streaming jobs.

AbdelrahmanMosly commented 4 months ago

@Javid-Shaik

Spark Jobs and Schedulers: Spark jobs do not need to be submitted via a scheduler. However, integrating with a scheduler helps in obtaining detailed job metadata and URLs.
Analyzing Streaming Jobs: While Dr. Elephant primarily analyzes batch jobs, it is possible to analyze streaming jobs with some additional effort. Here are some suggestions:
- Custom Metrics: Implement custom metrics in your streaming application to emit data that can be monitored and analyzed. These metrics can be pushed to a monitoring system compatible with Dr. Elephant.
- Periodic Snapshots: Configure your streaming jobs to periodically write state snapshots or logs that Dr. Elephant can process. This can help in capturing the state and performance of the streaming job over time.
- Use Monitoring Tools: Integrate your streaming jobs with monitoring tools like Prometheus or Grafana. These tools can provide real-time insights, and you can correlate this data with Dr. Elephant’s batch analysis to get a more comprehensive view.
- Extend Dr. Elephant: If you have the capability, you can modify and extend Dr. Elephant to better support streaming jobs. This would involve capturing and analyzing metrics specific to streaming workloads.
- Hybrid Approach: Use a combination of the above methods to gather insights into the performance and efficiency of your streaming jobs.

I haven't personally gone down this path as it wasn’t required in my case, so I don't have direct experience with these methods. However, these suggestions should help you get started.

Javid-Shaik commented 4 months ago

@AbdelrahmanMosly Thank you, AbdelRahman, for your invaluable guidance. I truly appreciate your help.

AbdelrahmanMosly commented 4 months ago

@Javid-Shaik You're welcome! Good luck with your work on Dr. Elephant.