if (
"spark.cores.min" not in self._spark_config_properties
and self._offload_transport_parallelism
):
# If the user has not configured spark.cores.min then default to offload_transport_parallelism + 1
# This only applies to Dataproc Serverless and therefore is not injected
# into self._spark_config_properties.
spark_config_props.extend(
[
"spark.cores.min={}".format(
str(self._offload_transport_parallelism + 1)
)
]
)
spark.cores.min is not influencing the default settings. We need to increase spark.executor.cores instead - if it is not already set by the user.
Note that the default value for spark.executor.cores is 4 and value values are 4, 8 and 16 only. Also note that this is specifying cores per executor instance, spark.executor.instances, which defaults to 2.
The best logic may be, if neither spark.executor.cores and spark.executor.instances is set by the user then increase spark.executor.cores to satisfy up to parallelism 32 and from there on increase spark.executor.instances. If either is set by the user then we should defer resource control to the user and log a message.
We are setting the wrong Spark property to increase Dataproc Batches parallelism.
https://github.com/gluent/goe/blob/main/src/goe/offload/spark/dataproc_offload_transport.py#L194
spark.cores.min
is not influencing the default settings. We need to increasespark.executor.cores
instead - if it is not already set by the user.Note that the default value for
spark.executor.cores
is4
and value values are4
,8
and16
only. Also note that this is specifying cores per executor instance,spark.executor.instances
, which defaults to2
.The best logic may be, if neither
spark.executor.cores
andspark.executor.instances
is set by the user then increasespark.executor.cores
to satisfy up to parallelism32
and from there on increasespark.executor.instances
. If either is set by the user then we should defer resource control to the user and log a message.