radanalyticsio / spark-operator

Operator for managing the Spark clusters on Kubernetes and OpenShift.
Apache License 2.0
157 stars 61 forks source link

Add new request and limit values to SparkCluster CR for cpu and memory #277

Closed tmckayus closed 4 years ago

tmckayus commented 4 years ago

Description

Previously a SparkCluster allowed an optional value for cpu and memory that was used to set both the limit and request value for each. This change adds new optional fields to individually set request and limit values.

To preserve backward compatibility:

Related Issue

Types of changes

:sparkles: New feature (non-breaking change which adds functionality)

tmckayus commented 4 years ago

@elmiko @jkremser ptal

erikerlandson commented 4 years ago

LGTM - are those accessor methods like w.getMemoryRequest() automatically generated?

tmckayus commented 4 years ago

LGTM - are those accessor methods like w.getMemoryRequest() automatically generated?

yes, the CRs are specified as json and then the build produces those internal classes

LaVLaS commented 4 years ago

Is it possible to have the situation where the user only sets the limit or request so that the OpenShift defaults can be automatically applied to the missing values?

tmckayus commented 4 years ago

@LaVLaS I was thinking about that -- this preserves the current behavior for "cpu" and "memory".

I suppose it could be modified -- "cpu" and "memory" set both to preserve the current API, we add cpuLimit and memoryLimit too and then the new ones only set the relevant value. I'll take a look, this might be an additional change in the code that generates the pod spec (as opposed to the stuff here which is just changing the map arguments)

tmckayus commented 4 years ago

Okay, this works. Modifying the tests. "cpu" and "memory" will work as before and set initial values if they are present "cpuRequest" and "memoryRequest" will take precedence over values from cpu/memory "cpuLimit" and "memoryLimit" will take precedence over values from cpu/memory

elmiko commented 4 years ago

that sounds good to me Trev, my only suggestion would be to document it somewhere. the readme doesn't quite have a good api section yet, but maybe adding something to the examples would work?

tmckayus commented 4 years ago

@elmiko all of the test cases are in the example, but a word in the README couldn't hurt

tmckayus commented 4 years ago

So for some reason in the test rig, sometimes the status updates fail. Not sure why this is. To keep this moving, I added a retry with a half second delay around the status update and suppressed any failures.

We might need a second pass in AbstractOperator to smooth this out

It could just be that the current travis setup here is using ocp 9. Probably shouldn't worry too much unless it's seen on 3.11 or 4.x (and we should try to update travis as well)

erikerlandson commented 4 years ago

looks like one of the tests just timed out

tmckayus commented 4 years ago

trying to speed up the additional tests ...

tmckayus commented 4 years ago

okay tests pass. @erikerlandson @LaVLaS are you good to merge?

LaVLaS commented 4 years ago

LGTM

erikerlandson commented 4 years ago

LGMT, please merge asap. Thanks @tmckayus !

tmckayus commented 4 years ago

@elmiko or @jkremser please merge :)

elmiko commented 4 years ago

lgtm, thanks everyone!

jkremser commented 4 years ago

Cool, it's good to see things moving 💯

One note though, although these limits are correctly passed to the container's k8s spec, I am afraid (y)our current implementation of rad.io's spark images doesn't actually honor them. Java process needs to be started w/ these jvm options to adequately size the jvm according to those params that are enforced by cgroups on the container.

tmckayus commented 4 years ago

@jkremser thanks! taking a look now

tmckayus commented 4 years ago

fyi, working on getting https://github.com/radanalyticsio/openshift-spark/pull/71 merged