gluent / goe

GOE: a simple and flexible way to copy data from an Oracle Database to Google BigQuery.
Apache License 2.0
8 stars 2 forks source link

Automate Dataproc Batches scaling in line with offload parallelism #165

Closed nj1973 closed 2 months ago

nj1973 commented 2 months ago

We are setting the wrong Spark property to increase Dataproc Batches parallelism.

https://github.com/gluent/goe/blob/main/src/goe/offload/spark/dataproc_offload_transport.py#L194

        if (
            "spark.cores.min" not in self._spark_config_properties
            and self._offload_transport_parallelism
        ):
            # If the user has not configured spark.cores.min then default to offload_transport_parallelism + 1
            # This only applies to Dataproc Serverless and therefore is not injected
            # into self._spark_config_properties.
            spark_config_props.extend(
                [
                    "spark.cores.min={}".format(
                        str(self._offload_transport_parallelism + 1)
                    )
                ]
            )

spark.cores.min is not influencing the default settings. We need to increase spark.executor.cores instead - if it is not already set by the user.

Note that the default value for spark.executor.cores is 4 and value values are 4, 8 and 16 only. Also note that this is specifying cores per executor instance, spark.executor.instances, which defaults to 2.

The best logic may be, if neither spark.executor.cores and spark.executor.instances is set by the user then increase spark.executor.cores to satisfy up to parallelism 32 and from there on increase spark.executor.instances. If either is set by the user then we should defer resource control to the user and log a message.