Performance degradation with RoundRobinSwitch

strNewBee commented 3 years ago

Sorry to bother you again, it may not be appropriate to post this here and ask for your help. I can't run the npf script due to some problems, so I have to reproduce the result by hand. I was testing RoundRobinSwitch with following configuration

DPDKInfo(2097151); //set up dpdk arguments
define(
    $numa    false,
    $verbose 99,
    $ndesc   4096,
    $active  true,
    $num     16,
    $rxnum   1
);

fd0 :: FromDPDKDevice(PORT 0, N_QUEUES $rxnum, MAXTHREADS $rxnum, VERBOSE 99, PAUSE none, NDESC 4096, PROMISC true, MODE none);
td0 :: ToDPDKDevice(PORT 1, N_QUEUES $num, VERBOSE 99, BLOCKING 0, NDESC 4096);

fd0 -> rr :: RoundRobinSwitch(MAX $num);
rr[0] -> qu0 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[1] -> qu1 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[2] -> qu2 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[3] -> qu3 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[4] -> qu4 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[5] -> qu5 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[6] -> qu6 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[7] -> qu7 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[8] -> qu8 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[9] -> qu9 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[10] -> qu10 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[11] -> qu11 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[12] -> qu12 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[13] -> qu13 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[14] -> qu14 :: Queue(CAPACITY 4096, BLOCKING 0);
rr[15] -> qu15 :: Queue(CAPACITY 4096, BLOCKING 0);

uq0 :: Unqueue(ACTIVE $active, SIGNAL false);
uq1 :: Unqueue(ACTIVE $active, SIGNAL false);
uq2 :: Unqueue(ACTIVE $active, SIGNAL false);
uq3 :: Unqueue(ACTIVE $active, SIGNAL false);
uq4 :: Unqueue(ACTIVE $active, SIGNAL false);
uq5 :: Unqueue(ACTIVE $active, SIGNAL false);
uq6 :: Unqueue(ACTIVE $active, SIGNAL false);
uq7 :: Unqueue(ACTIVE $active, SIGNAL false);
uq8 :: Unqueue(ACTIVE $active, SIGNAL false);
uq9 :: Unqueue(ACTIVE $active, SIGNAL false);
uq10 :: Unqueue(ACTIVE $active, SIGNAL false);
uq11 :: Unqueue(ACTIVE $active, SIGNAL false);
uq12 :: Unqueue(ACTIVE $active, SIGNAL false);
uq13 :: Unqueue(ACTIVE $active, SIGNAL false);
uq14 :: Unqueue(ACTIVE $active, SIGNAL false);
uq15 :: Unqueue(ACTIVE $active, SIGNAL false);

qu0, qu1, qu2, qu3, qu4, qu5, qu6, qu7, qu8, qu9, qu10, qu11, qu12, qu13, qu14, qu15
=> uq0, uq1, uq2, uq3, uq4, uq5, uq6, uq7, uq8, uq9, uq10, uq11, uq12, uq13, uq14, uq15
-> td0;

StaticThreadSched(uq0 17);
StaticThreadSched(uq1 18);
StaticThreadSched(uq2 19);
StaticThreadSched(uq3 20);
StaticThreadSched(uq4 21);
StaticThreadSched(uq5 22);
StaticThreadSched(uq6 23);
StaticThreadSched(uq7 24);
StaticThreadSched(uq8 25);
StaticThreadSched(uq9 26);
StaticThreadSched(uq10 27);
StaticThreadSched(uq11 28);
StaticThreadSched(uq12 29);
StaticThreadSched(uq13 30);
StaticThreadSched(uq14 31);
StaticThreadSched(uq15 32);

s :: Script(
    TYPE ACTIVE,
    label loop,
    read fd0.count,
    read fd0.queue_count,
    read td0.count,
    read td0.dropped,
    read uq0.count,
    read uq1.count,
    read uq2.count,
    read uq3.count,
    read uq4.count,
    read uq5.count,
    read uq6.count,
    read uq7.count,
    read uq8.count,
    read uq9.count,
    read uq10.count,
    read uq11.count,
    read uq12.count,
    read uq13.count,
    read uq14.count,
    read uq15.count,
    wait 5,
    goto loop,
);

StaticThreadSched(s 33);

I haven't added any other payload (like workpackage) but it can only send back 0.32Mpps out of 14.88Mpps. Was I doing anything wrong with this configuration causing the performance degradation?

tbarbette commented 3 years ago

Queue is an "push to pull" element, that has its own "task" (~= thread). This task will be scheduled on thread 0 by default. Therefore you're basically using only one core.

One piece of advice, keep htop running on the side to verify things run as intended. And launch perf top --cpu X from time to time to verify a core is doing what's intended.

But the biggest problem here is this Queue->Unqueue. Why do that? :p

FromDPDKDevice will receive packets on N threads. Then, you can pipe it to thread-safe elements that will handle correctly the fact they're traversed by multiple cores.

If you have old non-thread safe elements and cannot make them thread-safe, then you might use ExactCPUSwitch to have one path per core.

Also you might want to fix the NPF problem instead ? :p But that's up to you. Eventually you'll need to understand what happens in FastClick/RSS++...

strNewBee commented 3 years ago

Thank you for you help. And I'm so embarrassed right now with my silly questions...

I used the Queue->Unqueue because I thought ToDPDKDevice can only pulls packets from other element before sending it, and Queue->Unqueue combination is necessary to connect FromDPDKDevice and ToDPDKDevice by converting "push" to "pull".

And I saw that in the dpdk.testie, the pipeline/rr part was constructed with disp -> Paint -> Processer. In disp, Round Robin Switch distributes packets to several Queues, and then unqueued by Processer, which is more like a Queue -> Paint -> Unqueue procedure.

That's kind of why I used the confusing Queue -> Unqueue in the configuration, just to receive packets, distribute them to different threads and using multiple queues in ToDPDKDevice to see what kind of throughput and latency it can achieve without extra processing procedure. XD

tbarbette commented 3 years ago

No worries for the question, it's normal and you actually catched a bug in DPDK so I guess it's not that bad.

From-> Queue ->To is an old Click paradigm. Check the FastClick paper maybe? It will provide general insight about high-speed I/O.

The whole purpose of RSS++ is to avoid "dispatching" core bottlenecks, and allow full run-to-completion.

I understand your confusion from reading the dpdk.testie. This file implements all methods, including the ones that use this old-school "dispatcher -> queues". This is only to show a comparison in the paper. The "disp" will not be used if you look at the file imported for the RSS++ pipeline.

This is really all it takes : https://github.com/tbarbette/fastclick/blob/master/conf/rsspp/dpdk.click Replace EtherMirror by your real processing pipeline (or a WorkPackage if you want to reproduce some results).

strNewBee commented 3 years ago

I read the papers of RSS++ and fastclick, and I thought that RSS++ is intended to fully release the potentiality of run-to-completion mode (full-push) by adjusting the number of cores used and balancing different threads.

The RSS++ paper compared some software methods with RSS and RSS++, and those software methods are in pipeline mode (push-pull adopted by the vanilla click). As I'm using a CPU with 256 lcores top and a NIC with 16 hardware queues top, I was kind of interested in the old pipeline mode. So I was trying to check the performance gap between software dispatching methods and RSS, or RSS++, since I may need it if I want to scale more cores (more than NIC queues) in parallel to boost the processing. (It's a stupid thought so please ignore it...)

I know that the RR dispatcher is bad, but I didn't expect 0.3Mpps / 14.88Mpps kind of bad, because I tested in full-push mode with one RX queue / one TX queue / RSS, and it can reach 5Mpps / 14.88Mpps.

So I opened this issue to find out if the bottleneck is the single dispatching core (Round Robin Switch) / single hardware queue or I'm making some mistakes on basic configuration, since I'm really new to click and DPDK...

tbarbette commented 3 years ago

Ah yes I thought you were trying to run RSS++, but if you were trying to run a pipeline, that makes more sense.

I read the papers of RSS++ and fastclick, and I thought that RSS++ is intended to fully release the potentiality of run-to-completion mode (full-push) by adjusting the number of cores used and balancing different threads.

Yes, correct :) Except we don't balance the threads but the packets, so you stay in a pure sharded RTC approach.

As I'm using a CPU with 256 lcores top and a NIC with 16 hardware queues top, I was kind of interested in the old pipeline mode. In that case you don't have much choice indeed. But you're trying to pull a shiny carriage with an old donkey :) If you've got 256 cores, you should have at least a Connect-X 4 that has hundreds of queues. But with 256 cores I'd expect a shiny 100G NIC, like a CX6 DX.

But there is interest still. Systems like Shenango make use of dispatching cores for precise latency requirements and scheduling of applications, which RSS++ cannot provide. But one core will never handle the upcoming 400Gbps NICs. So you'll need to dispatch packets to those 4 (8? 6?) dispatching cores. And then you may need to run RSS++, to ensure load-balance and lowest ressource utilisation in this kind of tier one LB. Though I'd look more at pushing Shenango in a SmartNIC in practice.

I know that the RR dispatcher is bad, but I didn't expect 0.3Mpps / 14.88Mpps kind of bad, because I tested in full-push mode with one RX queue / one TX queue / RSS, and it can reach 5Mpps / 14.88Mpps.

I think there's in part a misconfiguration. First thing, could you use a Pipeliner element instead of Queue->Unqueue. And StaticThreadSched the pipeliners themselves instead of Unqueue of course. Then you should profile with perf record to see where the CPU is blocked, especially the dispatcher than a random worker core (be sure to compile fastclick with CXXFLAGS=-g).

But if you look at RSS++ paper in the figure where we compare those techniques you'll see adding more cores degrades the performance, because your single dispatching core is exchanging cache-line with all the cores like crazy. This should appear in the profile. Also one forgotten problem is that pipelining shares packets forwards, but also when recycling packets backwards, as the workers will receive packets, then must put them back in a local pool, and when full, put a batch of those packets in a global pool that the dispatcher will read from...

tbarbette commented 3 years ago

Is this fixed? :)

strNewBee commented 3 years ago

Is this fixed? :)

Unfortunately, no. :( Still, the more queues connecting to the RRSwitch, the worse it works. I'll close issue before I got it going.

tbarbette commented 3 years ago

The more queues the more contention you create. That's why people do RTC in high-speed networking :) As discussed above.

rsspp / experiments

Performance degradation with RoundRobinSwitch #6