Open eiffel-fl opened 2 years ago
Francis, you self-assigned this bug to yourself. Do you have a specific test in mind? What are you planning to do about this issue?
Do you have a specific test in mind? What are you planning to do about this issue?
Yes, the idea is to run this on AKS with a high number of node (let's call it N). On each of the N nodes, a pod which does something like this is run:
forks=0
launched=$(date)
while [[ $(date) -lt $((launched + S seconds)) ]; do
ls # Any command which is not a builtin and as fast as possible.
forks=$((forks + 2)) # date in condition counts as a fork.
done
echo "Total number of forks: ${forks}
So, for each node, we would get the number of forks it did in S seconds.
During this time, we would monitor all these pods with kubectl gadget trace exec
.
Then, at the end of the S seconds, we should compare the number of events printed by inspektor-gadget.
If this number equals sum(forks, 0, N)
, then we can conclude inspektor-gadget scales until N nodes.
We should test those IG components:
ig trace exec
in addition to kubectl-gadget trace exec
(this could be done with a kubectl debug node
command)It would be interesting to report the cpu usage (in percent) during the tests:
ig
or gadgettracermanager
process.
sudo /usr/share/bcc/tools/profile -p $PID_OF_IG
) and report the most common stacksig top ebpf
to measure).test ig trace exec in addition to kubectl-gadget trace exec (this could be done with a kubectl debug node command)
I do not really understand.
The whole goal of testing kubectl gadget trace exec
was to test the scaling on several nodes of Inspektor Gadget.
If you want to test ig
, you do not need to test it on several nodes, just on a big one (e.g. 96 cores or even more).
Otherwise, you would just test the scaling of kubectl debug node
.
Nonetheless testing both vertical (the size of a node) and horizontal (the number of node) is interesting. So far, I think we only tested horizontal in #803.
It would be interesting to report the cpu usage (in percent) during the tests:
cpu usage in userspace by the ig or gadgettracermanager process.
cpu profile (sudo /usr/share/bcc/tools/profile -p $PID_OF_IG) and report the most common stacks
cpu usage by IG's ebpf programs (using ig top ebpf to measure).
With the second, you only get the CPU usage of the eBPF program, right? It would be interesting to get both the eBPF program and the golang code. Particularly to get an understanding of at which cost we scale.
The whole goal of testing kubectl gadget trace exec was to test the scaling on several nodes of Inspektor Gadget. If you want to test ig, you do not need to test it on several nodes, just on a big one (e.g. 96 cores or even more).
I was thinking that when running ig
on a single worker node, it might still have scaling issues if there are a lot of nodes and a lot of service endpoints because the operators kubeipresolver and kubenameresolver still need to get the list of endpoints e.g. for the "trace tcp" gadget. But now I see ig
does not yet use those operators. cc @burak-ok
With the second, you only get the CPU usage of the eBPF program, right? It would be interesting to get both the eBPF program and the golang code. Particularly to get an understanding of at which cost we scale.
Yes exactly. Some users just want to know a global IG cpu usage percentage that includes both the eBPF programs and userspace for specific scenarios.
Current situation
We do not know if
inspektor-gadget
scales well. Normally it should, as a gadget pod is deployed within each node but there is maybe a bottleneck on data gathering for each node.Impact
Maybe
inspektor-gadget
does not scale over X nodes, who knows?Ideal future situation
We should be able to know the number of nodes
inspektor-gadget
can handle. If some minor modifications can be added to increase its scaling, they should be added.