stackabletech / documentation

Stackable's central documentation repository built on Antora
https://docs.stackable.tech
Apache License 2.0
11 stars 11 forks source link

Document system requirements #247

Open fhennig opened 2 years ago

fhennig commented 2 years ago

I did some research with the provided links from Cloudera and OpenShift and checked some of the CPU and memory consumption of the operators / products. I ran the (waterlevel)[https://docs.stackable.tech/stackablectl/stable/demos/nifi-kafka-druid-water-level-data.html] demo.

I used kind in order to be able to run it on a laptop locally and installed a metrics server for some insights:

 NAMESPACE               NAME                                                PF      READY           RESTARTS STATUS               CPU       MEM      %CPU/R      %CPU/L      %MEM/R      %MEM/L IP                NODE                     AGE
 default                 airflow-operator-deployment-55c59c64c9-9lr8f        ●       1/1                    0 Running                1         5         n/a         n/a         n/a         n/a 10.244.2.2        kind-worker2             93m
 default                 commons-operator-deployment-67bb7859bf-gwwrh        ●       1/1                    0 Running                0         8         n/a         n/a         n/a         n/a 10.244.3.2        kind-worker              92m
 default                 create-druid-ingestion-job-xm6wh                    ●       0/1                    7 Completed              0         0         n/a         n/a         n/a         n/a 10.244.3.9        kind-worker              86m
 default                 create-nifi-ingestion-job-4x5xz                     ●       0/1                    7 Completed              0         0         n/a         n/a         n/a         n/a 10.244.1.10       kind-worker3             86m
 default                 druid-broker-default-0                              ●       1/1                    0 Running               12       444         n/a         n/a         n/a         n/a 10.244.3.13       kind-worker              81m
 default                 druid-coordinator-default-0                         ●       1/1                    0 Running               49       484         n/a         n/a         n/a         n/a 10.244.3.14       kind-worker              81m
 default                 druid-historical-default-0                          ●       1/1                    0 Running                5       610         n/a         n/a         n/a         n/a 10.244.3.12       kind-worker              81m
 default                 druid-middlemanager-default-0                       ●       1/1                    0 Running               23      2088         n/a         n/a         n/a         n/a 10.244.3.11       kind-worker              81m
 default                 druid-middlemanager-default-1                       ●       1/1                    0 Running               15      2046         n/a         n/a         n/a         n/a 10.244.1.17       kind-worker3             81m
 default                 druid-operator-deployment-55dfc87cb5-58szg          ●       1/1                    0 Running                1        10         n/a         n/a         n/a         n/a 10.244.1.2        kind-worker3             92m
 default                 druid-router-default-0                              ●       1/1                    0 Running                5       249         n/a         n/a         n/a         n/a 10.244.3.15       kind-worker              81m
 default                 hbase-operator-deployment-7599b6cdd6-2zkqx          ●       1/1                    0 Running                0         4         n/a         n/a         n/a         n/a 10.244.1.3        kind-worker3             92m
 default                 hdfs-operator-deployment-7456467d65-w2gcs           ●       1/1                    0 Running                0         6         n/a         n/a         n/a         n/a 10.244.2.3        kind-worker2             91m
 default                 hive-operator-deployment-6d9d69b69c-prgp7           ●       1/1                    0 Running                1         5         n/a         n/a         n/a         n/a 10.244.3.3        kind-worker              91m
 default                 kafka-broker-default-0                              ●       2/2                    0 Running               21       773           8         n/a          37         n/a 10.244.1.13       kind-worker3             86m
 default                 kafka-operator-deployment-56fb5fdf9c-6jn4h          ●       1/1                    0 Running                0         7         n/a         n/a         n/a         n/a 10.244.2.4        kind-worker2             91m
 default                 minio-druid-7496648fdf-hb6d4                        ●       1/1                    0 Running                1        85         n/a         n/a           4         n/a 10.244.3.8        kind-worker              87m
 default                 nifi-node-default-0                                 ●       1/1                    0 Running              180      3353          36           4          81          81 10.244.2.17       kind-worker2             86m
 default                 nifi-operator-deployment-cff9497b5-kfxv8            ●       1/1                    0 Running                0         6         n/a         n/a         n/a         n/a 10.244.3.4        kind-worker              90m
 default                 opa-operator-deployment-57bfbbc89c-v46jv            ●       1/1                    0 Running                0         4         n/a         n/a         n/a         n/a 10.244.1.4        kind-worker3             90m
 default                 postgresql-superset-0                               ●       1/1                    0 Running                5        35           2         n/a          13         n/a 10.244.2.10       kind-worker2             87m
 default                 secret-operator-daemonset-7pdgb                     ●       3/3                    0 Running                1        21         n/a         n/a         n/a         n/a 10.244.2.5        kind-worker2             90m
 default                 secret-operator-daemonset-868hw                     ●       3/3                    0 Running                1        21         n/a         n/a         n/a         n/a 10.244.1.5        kind-worker3             90m
 default                 secret-operator-daemonset-mn2lz                     ●       3/3                    0 Running                1        21         n/a         n/a         n/a         n/a 10.244.3.5        kind-worker              90m
 default                 setup-superset-hshsh                                ●       0/1                    5 Completed              0         0         n/a         n/a         n/a         n/a 10.244.3.10       kind-worker              86m
 default                 spark-k8s-operator-deployment-887f994ff-t5hf6       ●       1/1                    0 Running                0         5         n/a         n/a         n/a         n/a 10.244.3.6        kind-worker              90m
 default                 superset-6dx72                                      ●       0/1                    0 Completed              0         0         n/a         n/a         n/a         n/a 10.244.1.8        kind-worker3             86m
 default                 superset-druid-connection-import-cgmdc              ●       0/1                    0 Completed              0         0         n/a         n/a         n/a         n/a 10.244.1.16       kind-worker3             81m
 default                 superset-node-default-0                             ●       2/2                    0 Running                4       174         n/a         n/a         n/a         n/a 10.244.1.15       kind-worker3             82m
 default                 superset-operator-deployment-7c4c46c5d-rgbsq        ●       1/1                    0 Running                0         7         n/a         n/a         n/a         n/a 10.244.2.6        kind-worker2             89m
 default                 trino-operator-deployment-cf4748586-d796c           ●       1/1                    0 Running                1         5         n/a         n/a         n/a         n/a 10.244.1.6        kind-worker3             88m
 default                 zookeeper-operator-deployment-76956ccffb-rrbr6      ●       1/1                    0 Running                0         9         n/a         n/a         n/a         n/a 10.244.1.7        kind-worker2             88m
 default                 zookeeper-server-default-0                          ●       1/1                    0 Running               13       269         n/a         n/a         n/a         n/a 10.244.1.11       kind-worker3             86m

This was after the cluster stabilized. CPU are mili, meaning a value of 1000 would be one core. Memory is specified in MB. The operators (even when reconciling) require almost no CPU/memory. Now this can of course change if we have 1000000 e.g. NiFi clusters to reconcile. So this should be specified. In general i would recommend about 1/5th or even 1/10th of a core and 50 to 100 mb memory for each operator (should be tested more reliably). The biggest product parts were NiFi with about 3.3GB memory and the two druid Middlemanagers with about 2GB memory each.

docker stats:

CONTAINER ID   NAME                 CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
5b13a847151c   kind-control-plane   12.80%    1.012GiB / 31.25GiB   3.24%     20.2MB / 130MB    118MB / 12.2MB    299
7f32cf575b9b   kind-worker3         8.13%     4.13GiB / 31.25GiB    13.21%    3.37GB / 1.39GB   6.75MB / 98.3kB   754
362844dd5007   kind-worker2         15.81%    4.882GiB / 31.25GiB   15.62%    2.56GB / 1.39GB   6.44MB / 98.3kB   464
63e79a8f56d5   kind-worker          5.71%     4.493GiB / 31.25GiB   14.38%    2.56GB / 46.6MB   3.4MB / 98.3kB    1257

This uses 31.25GB for each node (which is the overall memory of my computer), so the memory percentages should be multiplied by 4 assuming each node would get a quarter (minus OS etc.).

I did some testing to apply CRs / reconciling (docker exec -it kind-worker2 bash and running top): Applying a new cluster every 0.5 seconds the CPU did not exceed 2 percent:

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
22932 1000      20   0 1130012  22788  14256 S   2.0   0.1   0:00.37 stackable-zooke

Without the 0.5s sleep it did not exceed 7 percent:

PID   USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
24402 1000      20   0 1131640  24244  14048 S   7.3   0.1   0:00.51 stackable-zooke

Memory stayed unchanged.

What do others do?

They basically list each role for each product with recommended system requirements (CPU, Memory, Disc etc.). Additionally, they recommend e.g. increasing heap/memory depending on incoming connections etc. and sometimes provide hints on where to increase this. For disk they recommend standard disks vs SDDs etc. depending on the product.

What do we need to specify

Stackable Data Platform

I think by running some tests with multiple clusters, we can fix some values for CPU and memory for each operator and/or the whole SDP.

Products

I think we cannot specify anything reliable for products. We can specifiy e.g. the minimum requirements to run e.g. our demos, or a minimal example. But we do not know the customers data, queries etc. in order to give a proper estimation. I would refer here to the products and their requirements/scaling.

Cloudera e.g. lists hardware requirements (Heap/Memory, CPU, Disk) for products. I assume they come from experience / testing, i could not verify the values to be taken from any product website (e.g. HBase)

Upper limits for resources / clusters

I tested with a 1000 custom resources (zookeeper - but with replicas set to 0) just being applied after each other in a script and there were no issues (check the usages above). The API server is doing more than the operator... This should still be tested for every operator.

Minimum requirements for SDP and demos

I think at least for kind and what we are all testing everyday and everytime, a laptop with (16? to) 32GB memory and 4 to 8 cores can easily run any demo we produced so far.

Upper limits - How many resources can a single operator instance support?

Seeing the CPU and memory from the zk operator, i think it can handle more than there will ever be present in any cluster. But it probably makes sense to set an upper limit per cluster just to be on the safe side.

Recommended Documentation Layout

The operator and product requirements should be captured in a table per operator/product with the following columns:

What is left TODO / Acceptance (must be further refined and split up into tickets):

Operators

Products

Cluster

Misc

Hopefully merged properly with https://github.com/stackabletech/documentation/issues/258

fhennig commented 2 years ago

Since stefan mentions contracts, this seems like it has at least some legal aspects to it. Does that document need to be versioned? if it's referenced from a contract, it should probably be sort of "static"

Content wise, I'll brainstorm a few things:

What do similar companies do?

stefanigel commented 2 years ago

Maybe we should have sizing suggestions for a k8s cluster? Minimum 3 nodes, settings for k8s autoscaling because of resilience / failover, ... sizing of nodes etc. How many nodes minimum depending on number of data products / components?

maltesander commented 2 years ago

Maybe we should have sizing suggestions for a k8s cluster? Minimum 3 nodes, settings for k8s autoscaling because of resilience / failover, ... sizing of nodes etc. How many nodes minimum depending on number of data products / components?

Good points. Added under Cluster.