perfpod / memory-collector

A Kubernetes-native collector for monitoring memory subsystem interference between pods
Apache License 2.0
15 stars 3 forks source link

What packaging should we use for telegraf? #7

Closed yonch closed 10 hours ago

yonch commented 2 weeks ago

The initial plan is to run telegraf on a test cluster, and use the Intel RDT plugin to collect measurements. This will help us understand what's missing (if any) for the memory-collector.

How would we get to a telegraf deployment that supports RDT? It seems that it would need the pqos utility, and the README contains instructions.

So is there:

A solved issue will answer the above.

omokpo commented 1 week ago

@yonch

yonch commented 1 week ago

Ah great find!

Are the gitee links mirrors of this GitHub repo? https://github.com/intel/observability-telegraf

If so, the about mentions official images on https://hub.docker.com/r/intel/observability-telegraf

omokpo commented 1 week ago

Yes, the links are mirrors of the gitee repo. Awesome! We have official images in docker hub already.

I will pull the image and run it with the input plugin enabled on my Linux machine with file output.

omokpo commented 1 week ago

There are requirements to have a supported kernel version and a supported CPU needed to run the image.

I have a supported kernel which I verified with the below command. See https://github.com/nyrahul/linux-kernel-configs?tab=readme-ov-file#resctrl-resource-control-support

root@xxxxx-xx:~# cat /boot/config-$(uname -r) | grep CONFIG_X86_CPU_RESCTRL CONFIG_X86_CPU_RESCTRL=y

I don’t have a required machine CPU with Intel RDT support needed to run the image. Intel recommend these in the appendix A of this document - See https://cdrdv2-public.intel.com/789566/356688-intel-rdt-arch-spec.pdf#page76

I used the command below to verify I don’t have RDT support in my CPU. root@xxxx-xx:~# cpuid --one-cpu --leaf=0x7 --subleaf=0x0 | grep RDT RDT-CMT/PQoS cache monitoring = false RDT-CAT/PQE cache allocation = false

I was also able to verify I have a supported kernel.

I will get a machine that has the necessary supported CPU to test over the weekend.

yonch commented 1 week ago

Two Intel engineers at Kubecon said that pqos should be deprecated. They say it was written before resctrl, and manipulates CPU registers directly. In a system with resctrl, the two would race and interfere.

So the Telegraf plugin might not be our best bet. They said they have a more modern implementation for Kubernetes. I think this might be it: https://github.com/intel/cri-resource-manager

omokpo commented 1 week ago

For intel, we still need an intel CPU with RDT support. Resctrl is dependent on having a supported kernel version. The RDT documentation says to manipulate the registers - see 19.18.3 as I did with the CPUID tool I found on Ubuntu to check for support. If we stand up our bare metal Kubernetes nodes for containers, we should be okay with running our VM hypervisors based on qemu in a future kernel version. Support for Qemu is getting added.https://patchew.org/QEMU/20240905112237.3586972-1-whendrik@google.com/ it won't hurt to try on Cloud VMs, I'm just not optimistic it will work.

yonch commented 10 hours ago

Given deprecation of pqos, I'm closing this issue. Planning to open a goresctrl issue instead.