Closed Cian911 closed 2 weeks ago
Hey @Cian911 I'm not able to reproduce this locally so far with those values. From experience you get this error if coreRequest
or coreLimit
don't conform to the Kubernetes resource syntax. Do you have any mutating webhooks on the cluster that might mutate the request or limit fields on pod creation?
@Cian911 I can see that when emptyDir sizeLimit is nil, the args --conf spark.kubernetes.executor.volumes.emptyDir.spark-local-dir-nvme.options.sizeLimit=<nil>
will be added in spark-submit, which may fail the submission. Would you like to retry by giving the sizeLimit a specific value?
@ChenYi015 Bingo - I think this is it.
Myself and the team actually managed to fix the issue, exactly here, but by just changing the name of the emptyDir:
- emptyDir: {}
name: spark-local-dir-nvme
to:
- emptyDir: {}
name: local-nvme
I thought it was a problem caused by the SparkLocalDirPrefix value.
This looks like a better solution. Much appreciated for following up @ChenYi015 !
That's good. For volumes that are not prefixed with spark-local-dir-
, the volumeMounts will be patched by the webhook server. For those that are prefixed with spark-local-dir-
, they will be mounted by Spark during spark-submit. And there is a redundant sizeLimit conf with <nil>
value if sizeLimit is not specified. I have raised a PR to fix it so users can still use the volume names with local dir prefix.
Nice catch @ChenYi015. Apologies @Cian911 I only tested with the request values and not the full application spec including volumes.
No worries @jacobsalway it was a tricky one to find none the less. The corresponding error was not helpful and led me down the entirely wrong path for quite a while!
Description
I've been scratching my head on this one for the past few days - without any resolution.
I am in the process of testing migrating the spark operator from
spark-operator-chart-1.4.6
tov2.0.1
and have come across the following issues. It seems that submission fails at the point it tries to create a driver pod - with the following error around resource quantities:Below is the full error log.
First thing to note on this log line:
ERROR Client: Please check \"kubectl auth can-i create pod\" first. It should be yes.
- the CR is using a serviceAccount that does have the appropriate permissions to perform full CRUD operations to thepods
resource - just to rule that out before anyone asks.There is no change I made to the resource values compared to
spark-operator-chart-1.4.6
andv2.0.1
. My driver & executor resource asks essentially look like this:After enabling debug logs on the operator-controller, I can see that these values are correctly passed in and submitted as
--conf
arguments, but it fails directly after that.This smells to me that it is an issue with
spark:3.5.1
.. But I am not entirely sure. I will post the fullSparkApplication
below for reference.Reproduction Code [Required]
Expected behavior
Driver & Executor pods should spin up and job should start.
Actual behavior
Job submission fails.
Terminal Output Screenshot(s)
Environment & Versions
v2.0.1
v2.0.1
v1.29.3
3.4.1
Additional context
cc: @ChenYi015 @jacobsalway