Closed tantra35 closed 6 years ago
After some investigations may be setExecutorCount
in resource-managers\nomad\src\main\scala\org\apache\spark\scheduler\cluster\nomad\SparkNomadJobController.scala
should be rewritten like this
def setExecutorCount(count: Int): Unit = {
jobManipulator.updateJob(startIfNotYetRunning = count > 0) { job =>
val l_tg = SparkNomadJob.find(job, ExecutorTaskGroup).get
if (l_tg.getCount() < count)
l_tg.setCount(count)
}
}
???
Thanks for looking into this. This is fixed in the *-20180722 releases.
After some inverstigations, we found that setExecutorCount
steel have error because no any chance, to set group count to 0, so realy patch would be as follow:
def setExecutorCount(count: Int): Unit = {
jobManipulator.updateJob(startIfNotYetRunning = count > 0) { job =>
val executorGroup = SparkNomadJob.find(job, ExecutorTaskGroup).get
if (count > 0 && executorGroup.getCount < count) {
executorGroup.setCount(count)
} else if (count == 0) {
executorGroup.setCount(0)
}
}
}
I see you raised #17 for that. Thanks again. Fix coming.
when we use follow spark conf options in
spark-submit
(i.e. use dynamic allocations)When job near at it finish we see many
ERROR TaskSchedulerImpl: Lost executor
errors. As a result spark begins lost tasks
WARN TaskSetManager: Lost task 371.3 in stage
and sometimes spark-job will fail due lost task after 4 attempts (default for spark) As i understand this happens because at near job finish moment, nomad(through spark) begins shrink spark-executors pool, but spark at this moment still placing task on spark-executors, as result some times may happen that some spark task exceeded all attempts when spark try to place them on killed executors
For now workaround for us is increase
spark.task.maxFailures
to 20, or doesn't use dynamic resource allocations, due nomad can't shrink job by simple remove allocations by id