Open guangie88 opened 5 years ago
thanks for the details, @guangie88 , I will look into this.
@guangie88 When we use dinamic allocations we newer see log messages that you publish. Spark on nomad only increase decrease executor counters in method def setExecutorCount(count: Int): Unit = {
. Can you clarify how you got those messages? Only one reason that come in mind is that you manual kill executor in spark web UI
Hi there I have been trying out
nomad-spark
for both versionsv2.3.2
andv2.4.0
for Nomad0.8.6
for dynamic allocation, but kept encountering issues when Nomad tries to perform auto-downscaling after the Spark executors have been idling for some time.Basically what happened was that Nomad was able to auto-upscale fine when I ran
spark-shell
to read some large parquet files, e.g. 1 executor -> 5 executors pretty quickly, and the Spark job can be completed normally. Other jobs can also be completed if I were to purposely keep the executors busy.However once I leave it for some time (maybe 60s based on my
executorIdleTimeout
andcachedExecutorIdleTimeout
settings?), I believe Nomad tries to auto-downscale the number of executors, and I get the following warning messages withinspark-shell
:I understand that from the source code,
NomadClusterSchedulerBackend.scala
doesn't actually implement the down-scaling part indoKillExecutors
method (although it would really have been preferable to truly downscale the executors too), but strangely what happens after that new Spark jobs can still be submitted to the Spark driver and be observed via the Spark Web UI, but the so-called killed (but not killed) executors never pick up the new jobs any more, and neither will Nomad spawn new executors to run the new jobs, and so basically all new jobs will just pend forever.Any anyone encountered this issue and know what could be the issue? Thank you.
The following are some of my configs (some are just consul-template values):
spark-defaults.conf
job_template.json