Open blt opened 2 years ago
I started taking a look at this today. I think I have a path forward for conditionally enabling profiling (so that it can be a separate CI step given it impacts reported performance).
The first bump I ran into was trying to get perf
working in the container due to the distro package not matching the kernel version. I'm investigating ways around this (maybe by using perf
from the host or mounting perf
into the container).
Some references I found useful:
https://medium.com/@geekidea_81313/running-perf-in-docker-kubernetes-7eb878afcd42 https://gendignoux.com/blog/2019/11/09/profiling-rust-docker-perf.html
I'll continue poking at this when I get back as I have time, but if anyone has a strong motivation and free time to pick it up from me, just let me know!
I've been thinking about this too. I think it's worth taking a step back to restate the goals of our soak test infra. We want:
When we say 'local' we really mean on Linux and on OS X. Windows is a no-go without having a Windows developer contribute changes, evidence that Windows is a tier-1 concern for production. To be fair, OS X is not but we do have a number of people developing Vector on OS X. We don't support the BSDs in the soaks outright but if they happen to work that's a happy accident. Now, to get consistent behavior between OS X and Linux we've used minikube as the base environment for the soaks. The use of minikube has been a mixed bag. For most of our Linux devs it works straight out and it does give us a mostly-repeatable, clean environment to run vector and the soak support tools (lading, prometheus). Minikube on OS X has been a mixed bag and not every Linux dev has managed to get minikube running. There's also, I'm sure, throughput loss due to running in a container but since that's equivalent for baseline / comparison I'm not super bothered by this.
Anyway, all that's to say running in minikube hampers our ability to satisfy this ticket. There is a section in the minikube project docs about running bcc in cluster here but they haven't been updated in a year as of this writing and they strike me as, uh, very cursory. I'm not really even sure what the minikube-performance/minikube.iso
is and I can't find where they build it, so how is it different than the baseline iso? Admittedly I didn't look terribly hard but, eh. Not a lot of confidence that we wouldn't be trailblazing here.
All this is to say, I'm not married to the idea that we have to use containers in the soak infra: it's an implementation detail and nothing more. I have wondered from time to time what things would look like if we ran the soaks in a VM on non-Linux and used systemd to orchestrate entirely on Linux.
I've been working on #10730 and while hiking over the weekend it occurred to me that we could maybe solve two issues at once. So, today we soak inside a minikube as I discussed in my last comment not as an end-goal but with the assumption that this would make reproducibility easier, give us the ability to spin up necessary bits and pieces to support a soak. In practice we have found that we prefer to the point of exclusion the use of lading for source and sink external to vector. It gives us better control than, say, running a real elasticsearch in a minikube. What if we go one step further and assert that soaks are only done in terms of a rigging we fully control? So, extend lading to have a hyperfine-ish like mode where we run vector in a rig, run load through it and measure the throughput all in this new lading mode, with vector being a fully manage sub-process. We drop minikube entirely and run in a container or something if you go through soak/soak.sh
and on bare machine to avoid that overhead in CI. If the new lading mode detects you're running on linux it can segment the sub-process onto isolated CPUs, start perf hooks etc etc.
I want to build a prototype but the idea seemed good mid-hike and like it would address noise issues, running soak issues some users report, this ticket. It does make hard-to-fake environments hard to support, like #9751, so we'd need to keep the kube approach as a fallback probably.
FWIW I believe the general approach we take in #10800 can be reused for including linking to flamegraph artifacts.
At present we have soak tests that run and give detailed numbers about throughput but with no interior feedback about the behavior of the running vector. This means that users generate their own flamegraphs in non-repeatable environments and without necessarily capturing the SHA being experimented on. We can capture flamegraph material from the running soaks and present that in the final analysis. Constraints:
Considerations: