Open fhennig opened 2 years ago
Since stefan mentions contracts, this seems like it has at least some legal aspects to it. Does that document need to be versioned? if it's referenced from a contract, it should probably be sort of "static"
Content wise, I'll brainstorm a few things:
What do similar companies do?
Maybe we should have sizing suggestions for a k8s cluster? Minimum 3 nodes, settings for k8s autoscaling because of resilience / failover, ... sizing of nodes etc. How many nodes minimum depending on number of data products / components?
Maybe we should have sizing suggestions for a k8s cluster? Minimum 3 nodes, settings for k8s autoscaling because of resilience / failover, ... sizing of nodes etc. How many nodes minimum depending on number of data products / components?
Good points. Added under Cluster
.
I did some research with the provided links from Cloudera and OpenShift and checked some of the CPU and memory consumption of the operators / products. I ran the (waterlevel)[https://docs.stackable.tech/stackablectl/stable/demos/nifi-kafka-druid-water-level-data.html] demo.
docker exec -it kind-worker2 bash
I used kind in order to be able to run it on a laptop locally and installed a metrics server for some insights:
This was after the cluster stabilized. CPU are mili, meaning a value of 1000 would be one core. Memory is specified in MB. The operators (even when reconciling) require almost no CPU/memory. Now this can of course change if we have 1000000 e.g. NiFi clusters to reconcile. So this should be specified. In general i would recommend about 1/5th or even 1/10th of a core and 50 to 100 mb memory for each operator (should be tested more reliably). The biggest product parts were NiFi with about 3.3GB memory and the two druid Middlemanagers with about 2GB memory each.
docker stats:
This uses 31.25GB for each node (which is the overall memory of my computer), so the memory percentages should be multiplied by 4 assuming each node would get a quarter (minus OS etc.).
I did some testing to apply CRs / reconciling (
docker exec -it kind-worker2 bash
and runningtop
): Applying a new cluster every 0.5 seconds the CPU did not exceed 2 percent:Without the 0.5s sleep it did not exceed 7 percent:
Memory stayed unchanged.
What do others do?
They basically list each role for each product with recommended system requirements (CPU, Memory, Disc etc.). Additionally, they recommend e.g. increasing heap/memory depending on incoming connections etc. and sometimes provide hints on where to increase this. For disk they recommend standard disks vs SDDs etc. depending on the product.
What do we need to specify
Stackable Data Platform
I think by running some tests with multiple clusters, we can fix some values for CPU and memory for each operator and/or the whole SDP.
Products
I think we cannot specify anything reliable for products. We can specifiy e.g. the minimum requirements to run e.g. our demos, or a minimal example. But we do not know the customers data, queries etc. in order to give a proper estimation. I would refer here to the products and their requirements/scaling.
Cloudera e.g. lists hardware requirements (Heap/Memory, CPU, Disk) for products. I assume they come from experience / testing, i could not verify the values to be taken from any product website (e.g. HBase)
Upper limits for resources / clusters
I tested with a 1000 custom resources (zookeeper - but with replicas set to 0) just being applied after each other in a script and there were no issues (check the usages above). The API server is doing more than the operator... This should still be tested for every operator.
Minimum requirements for SDP and demos
I think at least for kind and what we are all testing everyday and everytime, a laptop with (16? to) 32GB memory and 4 to 8 cores can easily run any demo we produced so far.
Upper limits - How many resources can a single operator instance support?
Seeing the CPU and memory from the zk operator, i think it can handle more than there will ever be present in any cluster. But it probably makes sense to set an upper limit per cluster just to be on the safe side.
Recommended Documentation Layout
The operator and product requirements should be captured in a table per operator/product with the following columns:
What is left TODO / Acceptance (must be further refined and split up into tickets):
Operators
Products
Cluster
Misc
Hopefully merged properly with https://github.com/stackabletech/documentation/issues/258