pikrzysztof commented 8 months ago

There seems to be a lock contention in tpacket_rcv kernel function. We should investigate why that happens, maybe the socket has too little memory (unlikely?). I'd suggest seeing if we can get more blocks (see the document) maybe to check if we can get rid of lock contention.

https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt

pikrzysztof commented 8 months ago

I managed to find why %softirq is too high.

I ran an experiment where I ran iperf test 4 times:

s101 was always the receiving server
once sending traffic from s191 exclusively and had no kprobe running
once sending traffic from s191 exclusively with kprobe running on s101
once sending traffic from s191, c101, c102, c103 with kprobe running on s101
once sending traffic from s191, c101, c102, c103 with no kprobe running on s101

krpoxy-edited

We can see that %irq is significantly higher when we're running kprobe and there's multiple src IPs.

This is because a single connection traffic is handled by a single ksoftirqd. The moment you have multiple busy connections then you've got multiple busy ksoftirqds and all of them want to write to the same kprobe socket and need to share a lock.

We can shard kprobe work across multiple kprobe instances. How the packets get distributed to kprobe probably depends on AF_PACKET fanout primitives - see Alistair's great idea here: https://kentik.slack.com/archives/C02K282D85N/p1710950807748369?thread_ts=1710949860.784639&cid=C02K282D85N We predict that PACKET_FANOUT_CPU will yield the best results since there will be a predictable, (NUMCORES / NUM KPROBE INSTANCES) ksoftirqds fighting over a single kprobe socket.

If that works we should consider scaling this automatically maybe. Going full-golang (i.e. having as many sockets as cpus) may be a good idea.

pikrzysztof commented 7 months ago

I tested kprobe's impact on rx network performance in our1 a bit more.

The more connections (flows) hit the server the more severe kprobe's impact on rx performance. With 10 flows the data was not interesting so I cranked it up a notch to 100 connections. This was all UDP traffic with iperf.

Before I could get anything reasonable done I had to fix https://github.com/kentik/operations/issues/9878 so we can see more than 1.2Gbps rx traffic at all.

Let's get to the tests:

The data presented on the charts is pretty low resolution so some effects lag behind each other.

10 flows

pretty much OK because there's only 10 cores writing to a single socket. ~50kpps packet loss was visible still but we got to our 10Gbps that we paid for.

100 flows

here things get much more nasty.

FANOUT=1 - bad results. In theory the NIC accepted 7Gbps but at the same time it dropped 200kpps. At this high packet drop rate 7Gbps rx traffic no longer matters - we're long past the safety zone.
FANOUT=2, FANOUT=3 - ditto. For some reason FANOUT=3 sees higher packet drop than FANOUT=2? I might have mixed up the descriptions/test order. Either way we're seeing high packet loss because ksoftirqd is unable to empty NIC's rx buffers in tiime. Could be also because puppet ran and started kproxy service which I had to manually stop later.
FANOUT=4 - looks like the best parameter from the one we've tested. Less %softirq CPU usage, no packet drop
FANOUT=20 - as good FANOUT=4 but with slightly higher %softirq. I don't know why. We'd need to run more tests to learn why but I don't think we care
FANOUT=200 - dropping packets? Are we hitting some another scalability limit?

fanouttests

ghost commented 7 months ago

Very cool seeing this detailed testing! The FANOUT= settings are X instances of kprobe? I'd guess adding more instances than available cores would negatively impact performance.

pikrzysztof commented 7 months ago

The FANOUT= settings are X instances of kprobe

Yes, that's the number of instances of kprobe on receiving network interface. I used HASH lb strategy.

For now I'm testing 5 kprobes per interface https://github.com/kentik/operations/pull/9886/

pikrzysztof commented 7 months ago

Scalability greatly improved by https://github.com/kentik/kprobe/pull/35 - we can close this now.

kentik / kprobe

Kprobe does not scale very well with high (>5Gbps) rx traffic #33

10 flows

100 flows