Document system requirements

fhennig commented 2 years ago

I did some research with the provided links from Cloudera and OpenShift and checked some of the CPU and memory consumption of the operators / products. I ran the (waterlevel)[https://docs.stackable.tech/stackablectl/stable/demos/nifi-kafka-druid-water-level-data.html] demo.

Install a metricsserver (give @maltesander a ping if you need a metricsserver yaml working with kind (cannot upload it here unfortunantely)
Exec into a node and check running processes (via top): docker exec -it kind-worker2 bash

I used kind in order to be able to run it on a laptop locally and installed a metrics server for some insights:

 NAMESPACE               NAME                                                PF      READY           RESTARTS STATUS               CPU       MEM      %CPU/R      %CPU/L      %MEM/R      %MEM/L IP                NODE                     AGE
 default                 airflow-operator-deployment-55c59c64c9-9lr8f        ●       1/1                    0 Running                1         5         n/a         n/a         n/a         n/a 10.244.2.2        kind-worker2             93m
 default                 commons-operator-deployment-67bb7859bf-gwwrh        ●       1/1                    0 Running                0         8         n/a         n/a         n/a         n/a 10.244.3.2        kind-worker              92m
 default                 create-druid-ingestion-job-xm6wh                    ●       0/1                    7 Completed              0         0         n/a         n/a         n/a         n/a 10.244.3.9        kind-worker              86m
 default                 create-nifi-ingestion-job-4x5xz                     ●       0/1                    7 Completed              0         0         n/a         n/a         n/a         n/a 10.244.1.10       kind-worker3             86m
 default                 druid-broker-default-0                              ●       1/1                    0 Running               12       444         n/a         n/a         n/a         n/a 10.244.3.13       kind-worker              81m
 default                 druid-coordinator-default-0                         ●       1/1                    0 Running               49       484         n/a         n/a         n/a         n/a 10.244.3.14       kind-worker              81m
 default                 druid-historical-default-0                          ●       1/1                    0 Running                5       610         n/a         n/a         n/a         n/a 10.244.3.12       kind-worker              81m
 default                 druid-middlemanager-default-0                       ●       1/1                    0 Running               23      2088         n/a         n/a         n/a         n/a 10.244.3.11       kind-worker              81m
 default                 druid-middlemanager-default-1                       ●       1/1                    0 Running               15      2046         n/a         n/a         n/a         n/a 10.244.1.17       kind-worker3             81m
 default                 druid-operator-deployment-55dfc87cb5-58szg          ●       1/1                    0 Running                1        10         n/a         n/a         n/a         n/a 10.244.1.2        kind-worker3             92m
 default                 druid-router-default-0                              ●       1/1                    0 Running                5       249         n/a         n/a         n/a         n/a 10.244.3.15       kind-worker              81m
 default                 hbase-operator-deployment-7599b6cdd6-2zkqx          ●       1/1                    0 Running                0         4         n/a         n/a         n/a         n/a 10.244.1.3        kind-worker3             92m
 default                 hdfs-operator-deployment-7456467d65-w2gcs           ●       1/1                    0 Running                0         6         n/a         n/a         n/a         n/a 10.244.2.3        kind-worker2             91m
 default                 hive-operator-deployment-6d9d69b69c-prgp7           ●       1/1                    0 Running                1         5         n/a         n/a         n/a         n/a 10.244.3.3        kind-worker              91m
 default                 kafka-broker-default-0                              ●       2/2                    0 Running               21       773           8         n/a          37         n/a 10.244.1.13       kind-worker3             86m
 default                 kafka-operator-deployment-56fb5fdf9c-6jn4h          ●       1/1                    0 Running                0         7         n/a         n/a         n/a         n/a 10.244.2.4        kind-worker2             91m
 default                 minio-druid-7496648fdf-hb6d4                        ●       1/1                    0 Running                1        85         n/a         n/a           4         n/a 10.244.3.8        kind-worker              87m
 default                 nifi-node-default-0                                 ●       1/1                    0 Running              180      3353          36           4          81          81 10.244.2.17       kind-worker2             86m
 default                 nifi-operator-deployment-cff9497b5-kfxv8            ●       1/1                    0 Running                0         6         n/a         n/a         n/a         n/a 10.244.3.4        kind-worker              90m
 default                 opa-operator-deployment-57bfbbc89c-v46jv            ●       1/1                    0 Running                0         4         n/a         n/a         n/a         n/a 10.244.1.4        kind-worker3             90m
 default                 postgresql-superset-0                               ●       1/1                    0 Running                5        35           2         n/a          13         n/a 10.244.2.10       kind-worker2             87m
 default                 secret-operator-daemonset-7pdgb                     ●       3/3                    0 Running                1        21         n/a         n/a         n/a         n/a 10.244.2.5        kind-worker2             90m
 default                 secret-operator-daemonset-868hw                     ●       3/3                    0 Running                1        21         n/a         n/a         n/a         n/a 10.244.1.5        kind-worker3             90m
 default                 secret-operator-daemonset-mn2lz                     ●       3/3                    0 Running                1        21         n/a         n/a         n/a         n/a 10.244.3.5        kind-worker              90m
 default                 setup-superset-hshsh                                ●       0/1                    5 Completed              0         0         n/a         n/a         n/a         n/a 10.244.3.10       kind-worker              86m
 default                 spark-k8s-operator-deployment-887f994ff-t5hf6       ●       1/1                    0 Running                0         5         n/a         n/a         n/a         n/a 10.244.3.6        kind-worker              90m
 default                 superset-6dx72                                      ●       0/1                    0 Completed              0         0         n/a         n/a         n/a         n/a 10.244.1.8        kind-worker3             86m
 default                 superset-druid-connection-import-cgmdc              ●       0/1                    0 Completed              0         0         n/a         n/a         n/a         n/a 10.244.1.16       kind-worker3             81m
 default                 superset-node-default-0                             ●       2/2                    0 Running                4       174         n/a         n/a         n/a         n/a 10.244.1.15       kind-worker3             82m
 default                 superset-operator-deployment-7c4c46c5d-rgbsq        ●       1/1                    0 Running                0         7         n/a         n/a         n/a         n/a 10.244.2.6        kind-worker2             89m
 default                 trino-operator-deployment-cf4748586-d796c           ●       1/1                    0 Running                1         5         n/a         n/a         n/a         n/a 10.244.1.6        kind-worker3             88m
 default                 zookeeper-operator-deployment-76956ccffb-rrbr6      ●       1/1                    0 Running                0         9         n/a         n/a         n/a         n/a 10.244.1.7        kind-worker2             88m
 default                 zookeeper-server-default-0                          ●       1/1                    0 Running               13       269         n/a         n/a         n/a         n/a 10.244.1.11       kind-worker3             86m

This was after the cluster stabilized. CPU are mili, meaning a value of 1000 would be one core. Memory is specified in MB. The operators (even when reconciling) require almost no CPU/memory. Now this can of course change if we have 1000000 e.g. NiFi clusters to reconcile. So this should be specified. In general i would recommend about 1/5th or even 1/10th of a core and 50 to 100 mb memory for each operator (should be tested more reliably). The biggest product parts were NiFi with about 3.3GB memory and the two druid Middlemanagers with about 2GB memory each.

docker stats:

CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
5b13a847151c   kind-control-plane   12.80%    1.012GiB / 31.25GiB   3.24%     20.2MB / 130MB    118MB / 12.2MB    299
7f32cf575b9b   kind-worker3         8.13%     4.13GiB / 31.25GiB    13.21%    3.37GB / 1.39GB   6.75MB / 98.3kB   754
362844dd5007   kind-worker2         15.81%    4.882GiB / 31.25GiB   15.62%    2.56GB / 1.39GB   6.44MB / 98.3kB   464
63e79a8f56d5   kind-worker          5.71%     4.493GiB / 31.25GiB   14.38%    2.56GB / 46.6MB   3.4MB / 98.3kB    1257

This uses 31.25GB for each node (which is the overall memory of my computer), so the memory percentages should be multiplied by 4 assuming each node would get a quarter (minus OS etc.).

I did some testing to apply CRs / reconciling (docker exec -it kind-worker2 bash and running top): Applying a new cluster every 0.5 seconds the CPU did not exceed 2 percent:

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
22932 1000      20   0 1130012  22788  14256 S   2.0   0.1   0:00.37 stackable-zooke

Without the 0.5s sleep it did not exceed 7 percent:

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
24402 1000      20   0 1131640  24244  14048 S   7.3   0.1   0:00.51 stackable-zooke

Memory stayed unchanged.

What do others do?

They basically list each role for each product with recommended system requirements (CPU, Memory, Disc etc.). Additionally, they recommend e.g. increasing heap/memory depending on incoming connections etc. and sometimes provide hints on where to increase this. For disk they recommend standard disks vs SDDs etc. depending on the product.

What do we need to specify

Stackable Data Platform

I think by running some tests with multiple clusters, we can fix some values for CPU and memory for each operator and/or the whole SDP.

Products

I think we cannot specify anything reliable for products. We can specifiy e.g. the minimum requirements to run e.g. our demos, or a minimal example. But we do not know the customers data, queries etc. in order to give a proper estimation. I would refer here to the products and their requirements/scaling.

Cloudera e.g. lists hardware requirements (Heap/Memory, CPU, Disk) for products. I assume they come from experience / testing, i could not verify the values to be taken from any product website (e.g. HBase)

Upper limits for resources / clusters

I tested with a 1000 custom resources (zookeeper - but with replicas set to 0) just being applied after each other in a script and there were no issues (check the usages above). The API server is doing more than the operator... This should still be tested for every operator.

Minimum requirements for SDP and demos

I think at least for kind and what we are all testing everyday and everytime, a laptop with (16? to) 32GB memory and 4 to 8 cores can easily run any demo we produced so far.

Upper limits - How many resources can a single operator instance support?

Seeing the CPU and memory from the zk operator, i think it can handle more than there will ever be present in any cluster. But it probably makes sense to set an upper limit per cluster just to be on the safe side.

What is left TODO / Acceptance (must be further refined and split up into tickets):

Operators

[x] https://github.com/stackabletech/documentation/issues/289
[x] Define/Discuss if the operator requirements should be provided per operator or SDP
[x] Optional: Test the upper limits of operators. Will the operator crash before the cluster ;-) ?
[x] Document the minimal/maximal requirements (per release - needs to be versioned) to be referred to from GTCs and other contract documents

Products

[ ] Define/Discuss the minimal requirements for each product? Is it sufficiant to e.g. see a UI or should there be a "task" running without crashing?
[ ] Test e.g. the demos/stacks with limited resources to get closer to a real minimal requirement for the SDP cluster
[ ] Test the definend minimal requirements for each of our products in combination with the SDP (this should probably be an epic itself?)
[ ] Is there any rule of thumb for scaling? E.g. 10 clients -> 10GB memory etc.
[ ] Document the minimal/maximal requirements (per release - needs to be versioned) to be referred to from GTCs and other contract documents

Cluster

[ ] Which k8s implementations can we support? Is there a feature set that we can say we need? (I.e. we need namespaces, a PV provider that can do read-write-many...)
[ ] How should a cluster look like to benefit from HA (number of nodes etc.?
[ ] K8s autoscaling settings for resilience / failover / peaks?
[ ] Suggested number of nodes and node size depending on the products / components?

Misc

[ ] The Stackable support team is willing to accept incoming support issues that run on clusters using the documented minimal requirements
[ ] The documentation describes supported versions of Operating Systems (at least for stackablectl), Kubernetes product and version (managed and on-prem products) and SDP component versions

Hopefully merged properly with https://github.com/stackabletech/documentation/issues/258

fhennig commented 2 years ago

Since stefan mentions contracts, this seems like it has at least some legal aspects to it. Does that document need to be versioned? if it's referenced from a contract, it should probably be sort of "static"

Content wise, I'll brainstorm a few things:

We don't want to restrict too much, it should run almost anywhere
Some low-spec hardware isn't going to be sufficient. How can we find exact limits? Are exact limits even required? Should we give something for all operators together (The whole platform) or on a per operator basis)?
What about upper limits? How many resources can a single operator instance support?
Which k8s implementations can we support? Is there a feature set that we can say we need? (I.e. we need namespaces, a PV provider that can do read-write-many, ...)

What do similar companies do?

stefanigel commented 2 years ago

Maybe we should have sizing suggestions for a k8s cluster? Minimum 3 nodes, settings for k8s autoscaling because of resilience / failover, ... sizing of nodes etc. How many nodes minimum depending on number of data products / components?

maltesander commented 2 years ago

Maybe we should have sizing suggestions for a k8s cluster? Minimum 3 nodes, settings for k8s autoscaling because of resilience / failover, ... sizing of nodes etc. How many nodes minimum depending on number of data products / components?

Good points. Added under Cluster.

stackabletech / documentation