Below is migrated over from LD4P Kafka Spark Conversion Pipeline Work Cycle Work Cycle 1 (work originally done by Darren Weber). To be made more generic for over-arching DataOps questions and DevOps questions, whether cloud hosted or not is approached within DLSS. Probably to be clarified within work with @eefahy
Heuristic to determine the provisioning of a Kafka cluster
Step 1: Decide your processing time requirement
How many records do you need to process within what time constraint ?
Step2: Calculate the parameter of your processing
Apply the following formula to find out the number of Parallel Processing you will need:
Npr = (Ptr * Nr) / time
With the Number of Parallel Processing determine the number of records per processing
Num record per Parallel Processing = Nr / Npr
Calculate the Maximum size of each partition
Size of a Record * Num record per Parallel Processing
Step3: set the Virtual Spark cluster parameters
Decide the num of executors and how many Core per executors such that it must comply with the following formula:
num of executor * num of core per executor <= Num of Parallel Task i.e. Npr
Calculate the memory of each executors
Excutor Memory >= memory per Task * number of Task per executor + Memory for Internal + OverHead
Step4: Size the parameter of the Physical Cluster
Supporting material to finalize the fine grained sizing:
Below is migrated over from LD4P Kafka Spark Conversion Pipeline Work Cycle Work Cycle 1 (work originally done by Darren Weber). To be made more generic for over-arching DataOps questions and DevOps questions, whether cloud hosted or not is approached within DLSS. Probably to be clarified within work with @eefahy
Heuristic to determine the provisioning of a Kafka cluster
Supporting material to finalize the fine grained sizing: