Open superbobry opened 7 years ago
@superbobry I think it's an issue that should be fixed in SparkFetcher. On the other hand, I do have a question about the source of the problem. I have a suspicion that this problem is not triggered when getLogData is called. I think it happened when the fetcher tried to get other Spark data using REST calls, which is before then getLogData. The reason for my suspicion is the following message:
17/03/23 22:45:31 INFO history.ApplicationCache: Failed to load application attempt application_1490183156364_349617/Some(1)
This log messages is printed by loadApplicationEntry
method in ApplicationCache. The method, loadApplicationEntry
is called when history server cannot find SparkUI instance for an application in its cache, and creates SparkUI instance by reading the eventlog. I think history server is not able to find the event log when it's trying to read it.
Thanks for the reference. I will do more investigation on this and update the issue accordingly.
All I know now is that not fetching apps with completed:false
solves the 500 issue.
Also found the reason why certain applications are classified as incomplete. Check the the following code FsHistoryProvider:
/**
* Return true when the application has completed.
*/
private def isApplicationCompleted(entry: FileStatus): Boolean = {
!entry.getPath().getName().endsWith(EventLoggingListener.IN_PROGRESS)
}
From the code I referred in my earlier comment, it seems like that an application incomplete if .inprogress
is in the suffix of the eventlog file, which is also the case you reported. However, the history server could not find the eventlog file with .inprogress
suffix even though the completed: false
. That is strange.
Is it possible that in the beginning, the application was incomplete when history server first tried to get ApplicationInfo, and later on when history server tried to read the .inprogress
eventlog file for that application it got completed by that time?
The history server polls the log directory to update the log path for each application, and the polling interval is 10 seconds. I think it might have happened during that 10 sec time interval, the application was completed and the .inprogress
suffix was removed from the log path. And that's why the file wasn't found.
I agree with you. The best solution for this issue is to first only get ApplicationInfo from history server, and if completed: true
, then only try to get the rest of the data, otherwise keep it unanalyzed and anlyze it later.
@superbobry Are you already working on the fix, otherwise I can work on it.
Thank you, all this sounds reasonable. Do you think we should forward this to Spark devs? 500 clearly looks like a bug to me.
I have the fix for SparkFetcher
ready, but I think it'd better to submit it after #225 is merged, because the issue only affects REST-based log fetcher.
@superbobry How often it happened that after three retries you still saw the same error? I am asking because it's a little dangerous to always put an application in retry list based on completed: false
. It may happen that for some reason (a bug in history server), an application is always incomplete. In that case, the fetcher will keep that application in the retry list forever.
Moreover, if our theory is that this issue is happening due to a race condition where the fetcher thinks that an application is incomplete but the application is actually completed. That race condition should be resolved in the next retry attempt. What do you think?
I am also curious what's the value of spark.history.fs.update.interval
for your spark history server? The value of this config property is the polling interval of Spark history server, which is 10 secs by default.
I've tried setting maxRetries to a bigger value (6, 9) and I think the fetcher still never managed to fetch the application. I will do more experiments to validate that, because if it's true, than the cause is not the race condition.
I am also curious what's the value of
spark.history.fs.update.interval
for your spark history server?
We are using the default value, so it must be 10 secs.
That's true. If you still see the error after maxRetries, then it doesn't seem to be the race condition I was talking about.
I've been debugging Dr. Elephant on Spark 2.1 and noticed that some of the apps are never aggregated. Roughly it goes as follows:
AnalyticJobGeneratorHadoop2
queries the RM for a list of finished jobs/apps.SparkFetcher
gets all the new Spark apps. Some of them are truly finished, others (for reasons I don't yet completely understand) havecompleted: false
in a list of attempts.getLogData
(see #225). The app gets added to the retry queue and, if we're really unlucky, gets dropped after three attempts.The trace on the History Server side is
which suggests this could be the case for the WebHDFS-based
SparkLogClient
as well (although the HTTP code would probably be 404).The problem, as far as I understand it, is that only the apps with
completed:true
have logs andSparkFetcher
ignores that fact. A proper fix for this would be to delay the analysis on the app until it has completed, which could be done by not incrementing the counter inAnalyticJob.retry
.I am curious whether somebody had similar issues with Spark 2.1 or earlier versions? The
completed:true
contract is undocumented and could be plain wrong, do you think we should implement it inSparkFetcher
anyway?