Closed yonch closed 10 hours ago
@yonch
Ah great find!
Are the gitee links mirrors of this GitHub repo? https://github.com/intel/observability-telegraf
If so, the about mentions official images on https://hub.docker.com/r/intel/observability-telegraf
Yes, the links are mirrors of the gitee repo. Awesome! We have official images in docker hub already.
I will pull the image and run it with the input plugin enabled on my Linux machine with file output.
There are requirements to have a supported kernel version and a supported CPU needed to run the image.
I have a supported kernel which I verified with the below command. See https://github.com/nyrahul/linux-kernel-configs?tab=readme-ov-file#resctrl-resource-control-support
root@xxxxx-xx:~# cat /boot/config-$(uname -r) | grep CONFIG_X86_CPU_RESCTRL
CONFIG_X86_CPU_RESCTRL=y
I don’t have a required machine CPU with Intel RDT support needed to run the image. Intel recommend these in the appendix A of this document - See https://cdrdv2-public.intel.com/789566/356688-intel-rdt-arch-spec.pdf#page76
I used the command below to verify I don’t have RDT support in my CPU.
root@xxxx-xx:~# cpuid --one-cpu --leaf=0x7 --subleaf=0x0 | grep RDT
RDT-CMT/PQoS cache monitoring = false
RDT-CAT/PQE cache allocation = false
I was also able to verify I have a supported kernel.
I will get a machine that has the necessary supported CPU to test over the weekend.
Two Intel engineers at Kubecon said that pqos
should be deprecated. They say it was written before resctrl, and manipulates CPU registers directly. In a system with resctrl, the two would race and interfere.
So the Telegraf plugin might not be our best bet. They said they have a more modern implementation for Kubernetes. I think this might be it: https://github.com/intel/cri-resource-manager
For intel, we still need an intel CPU with RDT support. Resctrl is dependent on having a supported kernel version. The RDT documentation says to manipulate the registers - see 19.18.3 as I did with the CPUID tool I found on Ubuntu to check for support. If we stand up our bare metal Kubernetes nodes for containers, we should be okay with running our VM hypervisors based on qemu in a future kernel version. Support for Qemu is getting added.https://patchew.org/QEMU/20240905112237.3586972-1-whendrik@google.com/ it won't hurt to try on Cloud VMs, I'm just not optimistic it will work.
Given deprecation of pqos, I'm closing this issue. Planning to open a goresctrl
issue instead.
The initial plan is to run telegraf on a test cluster, and use the Intel RDT plugin to collect measurements. This will help us understand what's missing (if any) for the memory-collector.
How would we get to a telegraf deployment that supports RDT? It seems that it would need the
pqos
utility, and the README contains instructions.So is there:
A solved issue will answer the above.