network-analytics / mdt-dialout-collector

Model-Driven Telemetry - Collecting <multi-vendor> metrics via gRPC dialout
MIT License
27 stars 8 forks source link

install.sh vcpu count has issues in Kubernetes/OpenShift #25

Open sgaragan opened 4 months ago

sgaragan commented 4 months ago

When running the install.sh script, it determines the number of vCPUs by the following line:

readonly available_vcpu=$(egrep 'processor' /proc/cpuinfo | wc -l)

Later in the script, the value is used to determine the number of jobs to run for the 'make' command:

make -j`echo $((${available_vcpu} - 1))`

When running in Kubernetes/OpenShift, "egrep 'processor' /proc/cpuinfo | wc -l" gives the total number of processors on the node running the build as /proc/cpuinfo is not scoped by cgroup. The result is that it runs with as many jobs as there are cores (in our case, 72) which blows the RAM usage through the roof.

I was able to fix this by changing the value to 2 manually in the Dockerfile but if this could be added as an argument to install.sh (or available to override via an env variable) that would allow the build to work more effectively in a k8s-based environment

Thanks, Sean

scuzzilla commented 4 months ago

@sgaragan many thanks for your request. As a potential quick & easy solution to your problem, you might consider setting the global variable 'available_vcpu' to a value >= 3. For reference, see: https://github.com/network-analytics/mdt-dialout-collector/blob/main/install.sh#L44.

ustorbeck commented 4 months ago

You also could try to use the nproc command, which is more smart to get the number of cores assigned to a process. But I don't know if it's available on all platforms and how it behaves on kubernetes.

sgaragan commented 4 months ago

Thanks for the responses. We tried nproc but sadly it is not namespace aware so it too reports the total number of cores available to a node. We added the following to the Dockerfile to hardcode the value:

RUN cd /opt/mdt-dialout-collector && sed -i 's/available_vcpu=.*/available_vcpu=2/g' ./install.sh

Which does the trick but took a while to figure out why the builds were blowing up the memory being used to get here