SparkFetcher should delay collecting uncompleted Spark apps

superbobry commented 7 years ago

I've been debugging Dr. Elephant on Spark 2.1 and noticed that some of the apps are never aggregated. Roughly it goes as follows:

AnalyticJobGeneratorHadoop2 queries the RM for a list of finished jobs/apps.
SparkFetcher gets all the new Spark apps. Some of them are truly finished, others (for reasons I don't yet completely understand) have completed: false in a list of attempts.
The fetcher tries to resolve them and gets a 500 Internal Server Error during getLogData (see #225). The app gets added to the retry queue and, if we're really unlucky, gets dropped after three attempts.

The trace on the History Server side is

17/03/23 22:45:31 INFO history.ApplicationCache: Failed to load application attempt application_1490183156364_349617/Some(1)
17/03/23 22:45:31 WARN servlet.ServletHandler:
javax.servlet.ServletException: org.glassfish.jersey.server.ContainerException: java.io.FileNotFoundException: File does not exist: /var/log/spark2/application_1490183156364_349617_1.lz4.inprogress
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
...

which suggests this could be the case for the WebHDFS-based SparkLogClient as well (although the HTTP code would probably be 404).

The problem, as far as I understand it, is that only the apps with completed:true have logs and SparkFetcher ignores that fact. A proper fix for this would be to delay the analysis on the app until it has completed, which could be done by not incrementing the counter in AnalyticJob.retry.

I am curious whether somebody had similar issues with Spark 2.1 or earlier versions? The completed:true contract is undocumented and could be plain wrong, do you think we should implement it in SparkFetcher anyway?

shkhrgpt commented 7 years ago

@superbobry I think it's an issue that should be fixed in SparkFetcher. On the other hand, I do have a question about the source of the problem. I have a suspicion that this problem is not triggered when getLogData is called. I think it happened when the fetcher tried to get other Spark data using REST calls, which is before then getLogData. The reason for my suspicion is the following message:

17/03/23 22:45:31 INFO history.ApplicationCache: Failed to load application attempt application_1490183156364_349617/Some(1)

This log messages is printed by loadApplicationEntry method in ApplicationCache. The method, loadApplicationEntry is called when history server cannot find SparkUI instance for an application in its cache, and creates SparkUI instance by reading the eventlog. I think history server is not able to find the event log when it's trying to read it.

superbobry commented 7 years ago

Thanks for the reference. I will do more investigation on this and update the issue accordingly.

All I know now is that not fetching apps with completed:false solves the 500 issue.

shkhrgpt commented 7 years ago

Also found the reason why certain applications are classified as incomplete. Check the the following code FsHistoryProvider:

/**
   * Return true when the application has completed.
   */
  private def isApplicationCompleted(entry: FileStatus): Boolean = {
    !entry.getPath().getName().endsWith(EventLoggingListener.IN_PROGRESS)
  }

shkhrgpt commented 7 years ago

From the code I referred in my earlier comment, it seems like that an application incomplete if .inprogress is in the suffix of the eventlog file, which is also the case you reported. However, the history server could not find the eventlog file with .inprogress suffix even though the completed: false. That is strange. Is it possible that in the beginning, the application was incomplete when history server first tried to get ApplicationInfo, and later on when history server tried to read the .inprogress eventlog file for that application it got completed by that time?

shkhrgpt commented 7 years ago

The history server polls the log directory to update the log path for each application, and the polling interval is 10 seconds. I think it might have happened during that 10 sec time interval, the application was completed and the .inprogress suffix was removed from the log path. And that's why the file wasn't found.

I agree with you. The best solution for this issue is to first only get ApplicationInfo from history server, and if completed: true, then only try to get the rest of the data, otherwise keep it unanalyzed and anlyze it later.

@superbobry Are you already working on the fix, otherwise I can work on it.

superbobry commented 7 years ago

Thank you, all this sounds reasonable. Do you think we should forward this to Spark devs? 500 clearly looks like a bug to me.

I have the fix for SparkFetcher ready, but I think it'd better to submit it after #225 is merged, because the issue only affects REST-based log fetcher.

shkhrgpt commented 7 years ago

@superbobry How often it happened that after three retries you still saw the same error? I am asking because it's a little dangerous to always put an application in retry list based on completed: false. It may happen that for some reason (a bug in history server), an application is always incomplete. In that case, the fetcher will keep that application in the retry list forever.

Moreover, if our theory is that this issue is happening due to a race condition where the fetcher thinks that an application is incomplete but the application is actually completed. That race condition should be resolved in the next retry attempt. What do you think?

I am also curious what's the value of spark.history.fs.update.interval for your spark history server? The value of this config property is the polling interval of Spark history server, which is 10 secs by default.

superbobry commented 7 years ago

I've tried setting maxRetries to a bigger value (6, 9) and I think the fetcher still never managed to fetch the application. I will do more experiments to validate that, because if it's true, than the cause is not the race condition.

I am also curious what's the value of spark.history.fs.update.interval for your spark history server?

We are using the default value, so it must be 10 secs.

shkhrgpt commented 7 years ago

That's true. If you still see the error after maxRetries, then it doesn't seem to be the race condition I was talking about.

linkedin / dr-elephant

SparkFetcher should delay collecting uncompleted Spark apps #227