Closed pikrzysztof closed 7 months ago
I managed to find why %softirq is too high.
I ran an experiment where I ran iperf test 4 times:
We can see that %irq is significantly higher when we're running kprobe and there's multiple src IPs.
This is because a single connection traffic is handled by a single ksoftirqd. The moment you have multiple busy connections then you've got multiple busy ksoftirqds and all of them want to write to the same kprobe socket and need to share a lock.
We can shard kprobe work across multiple kprobe instances. How the packets get distributed to kprobe probably depends on AF_PACKET fanout primitives - see Alistair's great idea here: https://kentik.slack.com/archives/C02K282D85N/p1710950807748369?thread_ts=1710949860.784639&cid=C02K282D85N We predict that PACKET_FANOUT_CPU will yield the best results since there will be a predictable, (NUMCORES / NUM KPROBE INSTANCES) ksoftirqds fighting over a single kprobe socket.
If that works we should consider scaling this automatically maybe. Going full-golang (i.e. having as many sockets as cpus) may be a good idea.
I tested kprobe's impact on rx network performance in our1 a bit more.
The more connections (flows) hit the server the more severe kprobe's impact on rx performance. With 10 flows the data was not interesting so I cranked it up a notch to 100 connections. This was all UDP traffic with iperf
.
Before I could get anything reasonable done I had to fix https://github.com/kentik/operations/issues/9878 so we can see more than 1.2Gbps rx traffic at all.
Let's get to the tests:
The data presented on the charts is pretty low resolution so some effects lag behind each other.
pretty much OK because there's only 10 cores writing to a single socket. ~50kpps packet loss was visible still but we got to our 10Gbps that we paid for.
here things get much more nasty.
Very cool seeing this detailed testing! The FANOUT=
The FANOUT= settings are X instances of kprobe
Yes, that's the number of instances of kprobe on receiving network interface. I used HASH lb strategy.
For now I'm testing 5 kprobes per interface https://github.com/kentik/operations/pull/9886/
Scalability greatly improved by https://github.com/kentik/kprobe/pull/35 - we can close this now.
There seems to be a lock contention in tpacket_rcv kernel function. We should investigate why that happens, maybe the socket has too little memory (unlikely?). I'd suggest seeing if we can get more blocks (see the document) maybe to check if we can get rid of lock contention.
https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt