kentik / kprobe

GNU General Public License v2.0
15 stars 4 forks source link

Kprobe does not scale very well with high (>5Gbps) rx traffic #33

Closed pikrzysztof closed 7 months ago

pikrzysztof commented 8 months ago

There seems to be a lock contention in tpacket_rcv kernel function. We should investigate why that happens, maybe the socket has too little memory (unlikely?). I'd suggest seeing if we can get more blocks (see the document) maybe to check if we can get rid of lock contention.

https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt

pikrzysztof commented 8 months ago

I managed to find why %softirq is too high.

I ran an experiment where I ran iperf test 4 times:

krpoxy-edited

We can see that %irq is significantly higher when we're running kprobe and there's multiple src IPs.

This is because a single connection traffic is handled by a single ksoftirqd. The moment you have multiple busy connections then you've got multiple busy ksoftirqds and all of them want to write to the same kprobe socket and need to share a lock.

We can shard kprobe work across multiple kprobe instances. How the packets get distributed to kprobe probably depends on AF_PACKET fanout primitives - see Alistair's great idea here: https://kentik.slack.com/archives/C02K282D85N/p1710950807748369?thread_ts=1710949860.784639&cid=C02K282D85N We predict that PACKET_FANOUT_CPU will yield the best results since there will be a predictable, (NUMCORES / NUM KPROBE INSTANCES) ksoftirqds fighting over a single kprobe socket.

If that works we should consider scaling this automatically maybe. Going full-golang (i.e. having as many sockets as cpus) may be a good idea.

pikrzysztof commented 7 months ago

I tested kprobe's impact on rx network performance in our1 a bit more.

The more connections (flows) hit the server the more severe kprobe's impact on rx performance. With 10 flows the data was not interesting so I cranked it up a notch to 100 connections. This was all UDP traffic with iperf.

Before I could get anything reasonable done I had to fix https://github.com/kentik/operations/issues/9878 so we can see more than 1.2Gbps rx traffic at all.

Let's get to the tests:

The data presented on the charts is pretty low resolution so some effects lag behind each other.

10 flows

pretty much OK because there's only 10 cores writing to a single socket. ~50kpps packet loss was visible still but we got to our 10Gbps that we paid for.

100 flows

here things get much more nasty.

fanouttests

ghost commented 7 months ago

Very cool seeing this detailed testing! The FANOUT= settings are X instances of kprobe? I'd guess adding more instances than available cores would negatively impact performance.

pikrzysztof commented 7 months ago

The FANOUT= settings are X instances of kprobe

Yes, that's the number of instances of kprobe on receiving network interface. I used HASH lb strategy.

For now I'm testing 5 kprobes per interface https://github.com/kentik/operations/pull/9886/

pikrzysztof commented 7 months ago

Scalability greatly improved by https://github.com/kentik/kprobe/pull/35 - we can close this now.