Configurable multiple node pools [EPIC]

pbochynski commented 5 months ago

Description Kyma clusters should support multiple machine types simultaneously. For example GPU and ARM nodes, network, memory and CPU optimized nodes, etc.

Acceptance criteria:

[ ] KIM supports multiple node pools. Each pool has machine type, autoscaler (min/max), and HA related settings (zones or HA on/off)
[ ] KEB supports multiple node pools with the input parameters that can match KIM settings.
[ ] new settings (input parameters for KEB) should be backward compatible
[ ] metering is adjusted to different node types with a multiplying factor that reflects price differences (e.g. GPU has factor 2.0)
[ ] image builder can produce multi architecture images (amd64 and arm64) (#oci-image-builder/45 + #backlog/5589)

Reasons Our customers demand ARM and GPU nodes in Kyma clusters to run their workload on the architecture supporting their use cases. Examples:

running applications that require GPU like NVIDIA Omniverse

Related issues

https://github.com/kyma-project/infrastructure-manager/issues/46
18195
https://github.com/kyma-project/infrastructure-manager/issues/364 - TBD: issue for each Kyma module to be compatible with new support worker-pool permutations

tobiscr commented 4 months ago

Open questions

What are the customer needs for a second worker pool (cost saving?) ?
Are we allowing 0 worker pools?
What machine sizes are we going to support - do we limit the supported types for additional worker pools?
Do we have always a mandatory system-worker pool which is used for Kyma workload to ensure we have at least one compatible worker pool configured?
What config-parameters are we exposing (arch, min-max pool size, container-runtime etc.)?
How do we deal if customer selects ARM as worker pool - we have to ensure our workloads won't be installed on this architecture (Affinity for pool required)?
Is Gardner having limitations for multiple worker pools (e.g. the run their own workloads within K8s, e.g. cert-manager)?

Impacts

We have to ensure our modules are compatible with all supported worker pool (e.g. Istio could be mandatory for particular workloads even on second worker pool)
Exposing more configurable parameters increases testing efforts on our side!
Pod-Affinity is required to ensure Kyma workloads are per default scheduled on the "system worker pool"

tobiscr commented 4 months ago

Feedback from stakeholders:

@ebensom :

from operational side, we should not expose all worker-pool configurations of the Shoot spec (e.g. each worker can have different linux-image versions which can lead to security implications).
Patching worker pools need enhancement of the current logic to support multiple worker pools. The upgrade has to happen pool-by-pool.
We should in the midterm also consider to support gVisor support for our default worker pool (used by Kyma workload).

@varbanv :

Customers want to be able to configure different worker pools with different configurations (e.g. arch, sizes, gpu etc.). One reason is to deal with temporary load peaks or for cost saving reasons. Conclusion:
1. expose everything(?) in regards to machine types (not needed from day 1 - we can add machine types when customer requests them)
2. deal properly with billing (costs perspectives are very different between machine types and change frequently)
Workload has to run on particular worker pool (e.g. for cost saving purposes) Conclusion:
1. worker pool affinity required
2. Kyma runs always in its own worker-pool (only limited configurable by customers) to separate Kyma workloads from customers. But its not a dedicated worker pool for Kyma - it's still allowed for customers to schedule workloads in this worker pool.
3. customers can add additional worker pools which can be (fully) configured by them
4. we accept the risk that it's not guaranteed that a machine type is in each region available (depends on hyperscaler)
5. it's acceptable to make the worker-pool configurable outside of BTP cockpit (e.g. via kubectl calls - technical feasibility has to be clarified)
6. we have to deal with issues reported by Gardener properly and expect failure cases (e.g. machine type not supported in particular region) which have to be reported to customers
7. We start with a predefined list of machine types and we extend it when a need becomes visible
8. We have to make sure Kyma supports the offered worker pool configurations properly, like on ARM architecture (e.g. having daemonset installed on worker pools, e.g. Istio etc.)
  - Gardener workloads have also to support these worker pool configuration
  - Assumption: Gardener supports everything they offer in the Shoot spec.
9. Special configuration options for worker pools are:
  - Regions are for all zones equal (cluster workers are not allowed to run in different regions), also the CIDR configuration has to reside in the same network
  - Is has to be possible to configure the AZs for additional worker pools (e.g. having just 1 AZ for a worker pool)
  - Configuration has to be adjustable, e.g. node amount can be set to 0.
  - Is has to be clarified how to add support for NVidia GPUs (drivers are per default missing)

Currently supported worker parameter in RuntimeCR: https://github.com/kyma-project/infrastructure-manager/blob/main/config/samples/infrastructuremanager_v1_runtime.yaml#L56

tobiscr commented 4 months ago

Next steps / Action items:

@PK85 + @ebensom : decide on the configruation options we are exposing for customers and track it in this issue
@PK85 : Inform @a-thaler about the results
@zhoujing2022 has to be informed about adjusting the testing strategy to cover the new architectures (at least required for modules which require deamonset) - TBC if we run them only on the Kyma dedicated worker pool (via affinity) or the daemonset has to be compatible with new architectures

marco-porru commented 4 months ago

In general, I see a bigger demand for GPUs explicitly requested by different teams, some about AI, others for ML algorithms. The scope, in any case, is always to have dedicated nodes to run specific tasks.

Reasonable also to include m6g and m6in (or the current available generation) for SAP for Me One note on g5 and r7i this is required for SAP Intelligent Product Recommendation

PK85 commented 4 months ago

@tobiscr 1) @PK85 + @ebensom : decide on the configruation options we are exposing for customers and track it in this issue

We will go simple on KEB side. We will keep those (mandatory) parameters on root for system node pool(We will adjust descriptions):

"autoScalerMax": ...
"autoScalerMin": ..
"machineType": ..

NOTE: this is always HA min 3 nodes. We need to decide how to name that worker node pool, probably we use some name right now.

and new (optional )array of worker nodes for customer usage:

additionalWorkerNodePools [
"autoScalerMax": ...
"autoScalerMin": ..
"machineType": ..
]

NOTE: for now same validation as for system ones, thta means HA is mandatory.

About machineTypes we keep what we have for now, not extending that. Reason is that we first need to focus to run Kyma modules only in the system worker node pool. And second reason is that existing KMC will work without changing anything.

Later when we will release that and see that everything works we can add new machine Types including GPU etc, that requires to adjust billing etc.

Cheers, PK

ChristophRothmeier commented 2 months ago

Hello, my name is Christoph, i am project manager for Ingentis and we are using kyma running on SAP BTP. (currently running 10 clusters in 4 different landscapes). We are also looking forward to having different node pools in kyma, with the following use case:

We have some workloads, that require a very high amount of memory in a single operation. The requirements can go up to 128 GB of RAM. Of course we do not want to run all nodes of our cluster with 128 GB machines, cause this would be very expensive. The operations itself can not be optmized with low effort (We are generating large export files for power point and PDF and the third party libraries we are using for this, do not support streamed or chunked exports, they require to hold all in memory).

So for us it would be important to have system node pool with small machines (like 16 GB or 32 GB) and than an additional node pool for the heavy workloads (like 128 GB machines). It would be important for us to be able to scale down the additional node pool to zero, cause we only need the expensive machines in case there are heavy workloads. So in the moment a user queues in a heavy workload, we would spawn a pod on the additional node pool, the node pool should scale up, executes the workload (which typically needs some hours) and then scale down to zero, after the workloads are done.

We do not require to have new machine types, like GPU or ARM machines.

I hope this is a state we can reach at some point. As I understand, it's currently planned to release additional node pools with HA , so they have to have at least 3 nodes permanently, without the option to scale to zero?

Kind regards, Christoph

tobiscr commented 2 months ago

Hi @ChristophRothmeier , thanks for your request.

The multiple worker pool feature is currently in implementation and will be rolled out till end of this year. The list of supported machine types is at the beginning not extended and includes the same machine types as we offer when creating a new Kyma runtime via BTP cockpit. But support for additional machine types is already agreed and will be added soon after the worker pool feature is productive.

For go-live, we will also offer only worker pools with HA support (means, 3 nodes are the minimum). Scaling to 0 nodes is with a HA-supporting worker pool not possible but can be achieved by dropping the worker pool and re-creating it afterwards.

We are already in discussions to allow non-HA supporting worker pools with < 3 nodes. Such pools would also allow a scaling to 0 nodes.

ChristophRothmeier commented 2 months ago

Hi Tobias,

thanks for the response. for us it would be huge, to have the ability to scale down additional worker pools to zero with non-HA support. Could you send an update in this issue as soon your discussions about this topic have progressed and it is clear if and when it will be implemented?

Thanks Christoph

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs. Thank you for your contributions.

kyma-project / kyma

Configurable multiple node pools [EPIC] #18709

18195