Closed strNewBee closed 3 years ago
Queue is an "push to pull" element, that has its own "task" (~= thread). This task will be scheduled on thread 0 by default. Therefore you're basically using only one core.
One piece of advice, keep htop running on the side to verify things run as intended. And launch perf top --cpu X from time to time to verify a core is doing what's intended.
But the biggest problem here is this Queue->Unqueue. Why do that? :p
FromDPDKDevice will receive packets on N threads. Then, you can pipe it to thread-safe elements that will handle correctly the fact they're traversed by multiple cores.
If you have old non-thread safe elements and cannot make them thread-safe, then you might use ExactCPUSwitch to have one path per core.
Also you might want to fix the NPF problem instead ? :p But that's up to you. Eventually you'll need to understand what happens in FastClick/RSS++...
Thank you for you help. And I'm so embarrassed right now with my silly questions...
I used the Queue->Unqueue because I thought ToDPDKDevice can only pulls packets from other element before sending it, and Queue->Unqueue combination is necessary to connect FromDPDKDevice and ToDPDKDevice by converting "push" to "pull".
And I saw that in the dpdk.testie, the pipeline/rr part was constructed with disp -> Paint -> Processer. In disp, Round Robin Switch distributes packets to several Queues, and then unqueued by Processer, which is more like a Queue -> Paint -> Unqueue procedure.
That's kind of why I used the confusing Queue -> Unqueue in the configuration, just to receive packets, distribute them to different threads and using multiple queues in ToDPDKDevice to see what kind of throughput and latency it can achieve without extra processing procedure. XD
No worries for the question, it's normal and you actually catched a bug in DPDK so I guess it's not that bad.
From-> Queue ->To is an old Click paradigm. Check the FastClick paper maybe? It will provide general insight about high-speed I/O.
The whole purpose of RSS++ is to avoid "dispatching" core bottlenecks, and allow full run-to-completion.
I understand your confusion from reading the dpdk.testie. This file implements all methods, including the ones that use this old-school "dispatcher -> queues". This is only to show a comparison in the paper. The "disp" will not be used if you look at the file imported for the RSS++ pipeline.
This is really all it takes : https://github.com/tbarbette/fastclick/blob/master/conf/rsspp/dpdk.click Replace EtherMirror by your real processing pipeline (or a WorkPackage if you want to reproduce some results).
I read the papers of RSS++ and fastclick, and I thought that RSS++ is intended to fully release the potentiality of run-to-completion mode (full-push) by adjusting the number of cores used and balancing different threads.
The RSS++ paper compared some software methods with RSS and RSS++, and those software methods are in pipeline mode (push-pull adopted by the vanilla click). As I'm using a CPU with 256 lcores top and a NIC with 16 hardware queues top, I was kind of interested in the old pipeline mode. So I was trying to check the performance gap between software dispatching methods and RSS, or RSS++, since I may need it if I want to scale more cores (more than NIC queues) in parallel to boost the processing. (It's a stupid thought so please ignore it...)
I know that the RR dispatcher is bad, but I didn't expect 0.3Mpps / 14.88Mpps kind of bad, because I tested in full-push mode with one RX queue / one TX queue / RSS, and it can reach 5Mpps / 14.88Mpps.
So I opened this issue to find out if the bottleneck is the single dispatching core (Round Robin Switch) / single hardware queue or I'm making some mistakes on basic configuration, since I'm really new to click and DPDK...
Ah yes I thought you were trying to run RSS++, but if you were trying to run a pipeline, that makes more sense.
I read the papers of RSS++ and fastclick, and I thought that RSS++ is intended to fully release the potentiality of run-to-completion mode (full-push) by adjusting the number of cores used and balancing different threads.
Yes, correct :) Except we don't balance the threads but the packets, so you stay in a pure sharded RTC approach.
As I'm using a CPU with 256 lcores top and a NIC with 16 hardware queues top, I was kind of interested in the old pipeline mode. In that case you don't have much choice indeed. But you're trying to pull a shiny carriage with an old donkey :) If you've got 256 cores, you should have at least a Connect-X 4 that has hundreds of queues. But with 256 cores I'd expect a shiny 100G NIC, like a CX6 DX.
But there is interest still. Systems like Shenango make use of dispatching cores for precise latency requirements and scheduling of applications, which RSS++ cannot provide. But one core will never handle the upcoming 400Gbps NICs. So you'll need to dispatch packets to those 4 (8? 6?) dispatching cores. And then you may need to run RSS++, to ensure load-balance and lowest ressource utilisation in this kind of tier one LB. Though I'd look more at pushing Shenango in a SmartNIC in practice.
I know that the RR dispatcher is bad, but I didn't expect 0.3Mpps / 14.88Mpps kind of bad, because I tested in full-push mode with one RX queue / one TX queue / RSS, and it can reach 5Mpps / 14.88Mpps.
I think there's in part a misconfiguration. First thing, could you use a Pipeliner element instead of Queue->Unqueue. And StaticThreadSched the pipeliners themselves instead of Unqueue of course. Then you should profile with perf record to see where the CPU is blocked, especially the dispatcher than a random worker core (be sure to compile fastclick with CXXFLAGS=-g).
But if you look at RSS++ paper in the figure where we compare those techniques you'll see adding more cores degrades the performance, because your single dispatching core is exchanging cache-line with all the cores like crazy. This should appear in the profile. Also one forgotten problem is that pipelining shares packets forwards, but also when recycling packets backwards, as the workers will receive packets, then must put them back in a local pool, and when full, put a batch of those packets in a global pool that the dispatcher will read from...
Is this fixed? :)
Is this fixed? :)
Unfortunately, no. :( Still, the more queues connecting to the RRSwitch, the worse it works. I'll close issue before I got it going.
The more queues the more contention you create. That's why people do RTC in high-speed networking :) As discussed above.
Sorry to bother you again, it may not be appropriate to post this here and ask for your help. I can't run the npf script due to some problems, so I have to reproduce the result by hand. I was testing RoundRobinSwitch with following configuration
I haven't added any other payload (like workpackage) but it can only send back 0.32Mpps out of 14.88Mpps. Was I doing anything wrong with this configuration causing the performance degradation?