Using dynamic allocation hangs Spark job after autoscaling down Spark workers

guangie88 commented 5 years ago

Hi there I have been trying out nomad-spark for both versions v2.3.2 and v2.4.0 for Nomad 0.8.6 for dynamic allocation, but kept encountering issues when Nomad tries to perform auto-downscaling after the Spark executors have been idling for some time.

Basically what happened was that Nomad was able to auto-upscale fine when I ran spark-shell to read some large parquet files, e.g. 1 executor -> 5 executors pretty quickly, and the Spark job can be completed normally. Other jobs can also be completed if I were to purposely keep the executors busy.

However once I leave it for some time (maybe 60s based on my executorIdleTimeout and cachedExecutorIdleTimeout settings?), I believe Nomad tries to auto-downscale the number of executors, and I get the following warning messages within spark-shell:

19/01/26 16:34:32 WARN NomadClusterSchedulerBackend: Ignoring request to kill 1 executor(s): 3c508b0a-c46d-f8a4-6f52-99d28a4d250a-1548520286383 (not yet implemented for Nomad)
19/01/26 16:34:32 WARN ExecutorAllocationManager: Unable to reach the cluster manager to kill executor/s 3c508b0a-c46d-f8a4-6f52-99d28a4d250a-1548520286383 or no executor eligible to kill!
19/01/26 16:34:34 WARN NomadClusterSchedulerBackend: Ignoring request to kill 1 executor(s): 92755867-efb3-3abe-429a-1015c697a624-1548520286556 (not yet implemented for Nomad)
19/01/26 16:34:34 WARN ExecutorAllocationManager: Unable to reach the cluster manager to kill executor/s 92755867-efb3-3abe-429a-1015c697a624-1548520286556 or no executor eligible to kill!

I understand that from the source code, NomadClusterSchedulerBackend.scala doesn't actually implement the down-scaling part in doKillExecutors method (although it would really have been preferable to truly downscale the executors too), but strangely what happens after that new Spark jobs can still be submitted to the Spark driver and be observed via the Spark Web UI, but the so-called killed (but not killed) executors never pick up the new jobs any more, and neither will Nomad spawn new executors to run the new jobs, and so basically all new jobs will just pend forever.

Any anyone encountered this issue and know what could be the issue? Thank you.

The following are some of my configs (some are just consul-template values):

spark-defaults.conf

spark.master                                       nomad
spark.driver.bindAddress                           0.0.0.0
spark.driver.host                                  {{ with node }}{{ .Node.Address }}{{ end }}

spark.sql.catalogImplementation                    hive
spark.shuffle.service.enabled                      true
spark.dynamicAllocation.enabled                    true
spark.dynamicAllocation.initialExecutors           1
spark.dynamicAllocation.minExecutors               1
spark.dynamicAllocation.maxExecutors               50

spark.dynamicAllocation.executorIdleTimeout        60s
spark.dynamicAllocation.cachedExecutorIdleTimeout  60s

spark.executor.instances                           1
spark.executor.cores                               2
spark.executor.memory                              3g

spark.nomad.dockerImage                            {{ key "${spark_nomad_docker_image_key}" }}
spark.nomad.sparkDistribution                      {{ key "${spark_nomad_spark_distribution_key}" }}
spark.nomad.datacenters                            {{ key "${spark_nomad_datacenters_key}" }}
spark.nomad.job.template                           {{ key "${spark_nomad_job_template_key}" }}

job_template.json

{
    "Job": {
        "Constraints": [
            {
                "LTarget": "$${node.class}",
                "RTarget": "${node_class}",
                "Operand": "="
            }
        ],

        "Datacenters": ${az},
        "Region": "${region}",
        "Type": "service"
    }
}

cgbaker commented 5 years ago

thanks for the details, @guangie88 , I will look into this.

tantra35 commented 5 years ago

@guangie88 When we use dinamic allocations we newer see log messages that you publish. Spark on nomad only increase decrease executor counters in method def setExecutorCount(count: Int): Unit = {. Can you clarify how you got those messages? Only one reason that come in mind is that you manual kill executor in spark web UI

hashicorp / nomad-spark

Using dynamic allocation hangs Spark job after autoscaling down Spark workers #20