100G Packet Rates: Per-CPU vs Per-Port

lukego commented 8 years ago

I am pondering how to think about packet rates in the 100G era. How should we be designing and optimizing our software?

Consider these potential performance targets:

A: 1x100G @ 64 Mpps.
B: 1x100G @ 96 Mpps.
C: 2x100G @ 128 Mpps (max 64 Mpps per port).

I have a whole mix of questions about these in my mind:

Which target is most meaningful? For which applications?
How do you optimize for each target?
Which one is harder to achieve? Is the main challenge software optimization or hardware selection?

Raw brain dump...

Performance "A" seems suitable for many applications. This is providing 100G of bandwidth for average packet size of around 192 bytes or higher.
Performance "B" may be more suitable for some specific applications. Load generators (like packetblaster) may want to maximize the packet rate on a single port. Packet capture applications (like firehose) may need to never miss a packet.
Performance "C" seems suitable for applications that need to scale out. For example IPv4-IPv6 translation (lwAFTR) may want to maximize hardware density by handling as many 100G ports per server as possible.

So how would you optimize for each? In every case you would surely use multiple cores with RSS or equivalent traffic dispatching. Beyond that...

Performance "A" would let you choose your own trade-off between software optimization or throwing hardware at the problem. If you are ambitious you may pick a low-end CPU like a Xeon E3-1650v3 (6 cores @ 3.5 GHz) for 328 cycles per packet processing budget. If you prefer to throw hardware at the problem you may pick a high-end CPU like a Xeon E5-2699v4 (22 cores @ 2.2 GHz) for 756 cycles per packet. There are plenty of points in between, too.
Performance "B" is putting more strain on both the CPU and the NIC. You may need to use special driver routines to get the NIC performance you want & this may involve trading off CPU resources to reduce load on the NIC. For example, ConnectX-4 NICs may provide better packet rates with "inline descriptor" mode but this involves a complete packet copy in the driver routine i.e. spending CPU cycles and cache footprint to assist the DMA engine on the NIC. Similarly the NICs have hardware features for e.g. tuple-matching on packets but those surely have performance limits (e.g. rules checked per second) that you are more likely to discover the harder you press them.
Performance "C" is putting twice the strain on the CPU compared with "A". This would require software optimization since it halves the cycles-per-packet budget. It should not cause NIC problems since the per-port load is the same. There seems to be a risk of uncovering performance limits in the "uncore" parts of the processor e.g. the L3 cache, the DMA engine, the RAM controller, the IOMMU (if it were used), and so on.

So which would be hardest to achieve? and why?

The one I have a bad feeling about is "B". Historically we are used to NICs that can do line-rate with 64B packets. However, those days may be behind us. If you read Intel datasheets then the minimum packet rate they are guaranteeing is 64B for 10G (82599), 128B for 40G (XL710), and 256B for 100G (FM10K). (This is lower even than performance "A".) If our performance targets for the NICs are above what they are designed for then we are probably headed for trouble. I think if we want to support really high per-port packet rates then it will take a lot of work and we will be very contained in which hardware we can choose (both vendor and enabled features).

So, stumbling back towards the development de jour, I am tempted to initially accept the 64 Mpps per port limit observed in #1007 and focus on supporting "A" and "C". In practical terms this means spending my efforts on writing simple and CPU-efficient transmit/receive routines rather than looking for complex and CPU-expensive ways to squeeze more packets down a single card. We can always revisit the problem of squeezing the maximum packet rate out of a card in the context of specific applications (e.g. packetblaster and firehose) and there we may be able to "cheat" in some useful application-specific ways.

Early days anyway... next step is to see how the ConnectX-4 performs with simultaneous transmit+receive using generic routines. Can't take the 64 Mpps figure to the bank quite yet.

Thoughts?

sleinen commented 8 years ago

Nice description of the problem, and I agree with your conclusions. For most real applications you should be able to reduce b) to c) at the cost of additional ports. Exceptions? Artificial constraints such as "dragster-race" competitions (Internet2 land-speed records) or unrealistic customer expectations ("we only ever buy kit that does line rate even with 64-byte christmas-tree-packet workloads").

Cost of additional ports may be a problem, but that needs to be weighted against development costs as well. (You can formulate that as a time-to-market argument where you have the choice of either getting a working system now and upgrade it to the desired throughput once additional ports have gotten cheaper, or waiting until a "more efficient" system is developed that can do the same work with just one port :-)

lukego commented 8 years ago

Relatedly: Nathan Owens pointed out to me via Twitter that the sexy Broadcom Tomahawk 32x100G switches only do line-rate with >= 250B packets. Seems to be confirmed on ipspace.net.

virtuallynathan commented 8 years ago

As far as other switches go, Mellanox Spectrum can do line-rate at all packet sizes. Based on their "independent" testing, it seems Broadcom's spec is not 100% accurate, see page 11: http://www.mellanox.com/related-docs/products/tolly-report-performance-evaluation-2016-march.pdf

I haven't seen a number for Cavium Xpliant. On Thu, Sep 8, 2016 at 2:51 AM Luke Gorrie notifications@github.com wrote:

Relatedly: Nathan Owens pointed out to me via Twitter that the sexy Broadcom Tomahawk 32x100G switches only do line-rate with >= 250B packets. Seems to be confirmed on ipspace.net http://blog.ipspace.net/2015/12/broadcom-tomahawk-101.html.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/snabbco/snabb/issues/1013#issuecomment-245549003, or mute the thread https://github.com/notifications/unsubscribe-auth/ADT7LazQEjat9-FbjcATRK1Oi_fHKjOZks5qn9qVgaJpZM4J2yWB .

plajjan commented 8 years ago

I don't think you should go out of your way to support what seems to be a bad NIC, i.e. if it requires you to move packets to certain places and thus decreasing performance in Snabb it's a bad move.

I want to be able to get wirespeed performance out of this by asking a vendor to produce a fast NIC and then just throw more cores out of it. If someone don't need wirespeed they can buy a bad/cheaper NIC (seemingly like this Mellanox) and can use less cores.

Most importantly the decision on pps/bps should be with the end-user :)

The first 10G NICs I used didn't do much more than 5Gbps. I think it's too early in the life of 100G NICs to draw conclusions on general trends.

lukego commented 8 years ago

Here are some public performance numbers from Mellanox: https://www.mellanox.com/blog/2016/06/performance-beyond-numbers-stephen-curry-style-server-io/

The headlines there are line-rate 64B with 25G NIC and 74.4 Mpps max on 100G. (I am told they have squeezed a bit more than this on the 100G but I haven't found a published account of that.)

Note that there are two different ASICs: "ConnectX-4" (100G) and "ConnectX-4 Lx" (10G/25G/40G/50G). If you needed more silicon horsepower per 100G, for example to do line-rate with 64B packets, maybe combining 4x25G NICs would be a viable option? (Is that likely to cause interop issues with 100G ports on switches/routers in practice?)

lukego commented 8 years ago

I tested the ConnectX-4 with every packet size 60..1500 and at both 3.5 GHz and 2.0 GHz.

rplot01

Whaddayareckon?

plajjan commented 8 years ago

Interesting graph. Is it some fixed buffer size that leads to the plateaus?

lukego commented 8 years ago

Good question. It looks like the size of each packet is effectively being rounded up to a multiple of 64. I wonder what would cause this?

Suspects to eliminate:

Software.
DMA.
Ethernet MAC/PHY.

lukego commented 8 years ago

DMA/PCIe

I would really like to extend our PMU support to also track "uncore" counters like PCIe/RAM/NUMA activity. This way we could include all of those values in the data sets.

Meanwhile I created a little table by hand. This shows the PCIe activity on both sides of the first four distinct plateaus.

Mpps  PacketSize   PCIeRdCur (M)  DRd (M)  PCIeRead (GB)  PCIeWrite (GB)
37    190          1323           355      107            0.082
37    250          1656           356      128            0.082

30    260          1592           277      120            0.066
30    316          1591           284      120            0.066

25    320          1311           232       99            0.054
25    380          1529           232      112            0.054

21    384          1313           199       97            0.046
21    440          1512           204      110            0.046

This is based on me snipping bits from the output of the Intel Performance Counter Monitor tool. I found some discussion of its output here.

Here is a very preliminary idea of how I am interpreting these columns:

Mpps: Approximate packet rate of the plateau.
PacketSize: Bytes per packet (+CRC).
PCIeRdCur (M): Millions of 64B cache lines fetched from memory via PCIe.
DRd (M): Millions of 64B cache lines fetched from L3 cache via DDIO.
PCIeRead (GB) and PCIeWrite (GB): Total data read by NIC / written by NIC over PCIe. (Docs seem to say that this is Gigabytes but the numbers only makes sense to me as Gigabits.)

How to interpret this? In principle it seems tempting to blame the "64B-wide plateau" issue on DMA if it is fetching data in 64B cache lines. Trouble is that then I would expect to see the same level of PCIe traffic for both sides of the plateau -- and probably with PCIe bandwidth maxxed out at 128Gbps (PCIe 3.0 x16 slot). However, in most cases it seems like PCIe bandwidth is not maxxed out and the right-hand side of the plateau is transferring more data.

So: no smoking gun from looking at PCIe performance counters.

Ethernet MAC/PHY

I have never really looked closely at the internals of Layer-2 and Layer-1 on Ethernet. Just a few observations from reading wikipedia though:

100GbE uses 64b/66b encoding.
100GbE uses four physical channels.

So as a wild guess it seems possible that 100GbE would show some effects at 32-byte granularity (64 bit * 4 channel) based on the physical transport. However, this would only be 1/2 of the plateau size, and I don't know whether this 64-bit/4-channel grouping is visible in the MAC layer or just an internal detail of the physical layer.

I am running a test on a 10G ConnectX-4 NIC now just out of curiosity. If this showed plateaus with 1/4 the width then it may be reasonable to wonder if the issue is PHY related (10GbE also uses 64b/66b but via only one channel).

fmadio commented 8 years ago

probably want to look at L3 miss rate and/or DDR Rd counters as it steps.

Gen3 x16 will max out ~ 112Gbps after the encoding overhead in practice.

plajjan commented 8 years ago

I don't think 64b/66b has anything to do with this. That's just avoiding certain bit patterns on the wire and happens real close to the wire nor do I think it's related the AUI interface (which I assume you are referring to).

Doesn't the NIC copy packets from RAM to some little circular transmit buffer just before it sends them out? Is that buffer carved up in 64 byte slices?

fmadio commented 8 years ago

Yeah 64/66 encoding is not connected to this at all, there`s absolutely no flow control when your at that level, its 103.125Gbps or 0 Gbps with nothing in between.

There should be some wide FIFO`s before transferring to the CMAC and down the wire, but even then it should be at least 64B wide (read 512bit ) interface, which means 512b x say 250Mhz -> 128Gbps. More importantly that will effect the PPS rate which even at 128B packets would clock in at 125Mpps (2 clocks @ 250Mhz). My money is on L3/LLC or UnCore or QPI cache/request counts.

lukego commented 8 years ago

@fmadio Yes, this sounds like the most promising line of inquiry now: Can we explain the performance here, including the plateau every 64B, in terms of the way the memory subsystem is serving DMA requests. And if the memory subsystem is the bottleneck then can we improve its performance e.g. by serving more requests from L3 cache rather than DRAM.

Time for me to read the Intel Uncore Performance Monitoring Reference Manual...

fmadio commented 8 years ago

Yup its probably QPI / L3 / DDR some where some how. Assuming the Tx payloads are unique memory locations the plateau is the PCIe requestor hitting a 64B line somewhere, the drop is additional latency to fetch the next line, probably Uncore -> QPI -> LLC/L3. Note that the PCIe EP on the Uncore does not do any prefetching such as the CPUs DCU streamer, thus its a cold hard miss... back to fun days of CPU`s with no cache!

If you really want to dig into it suggest getting a PCIe sniffer but those things are dam expensive :(

lukego commented 8 years ago

@fmadio Great thoughts, keep 'em coming :).

I added a couple of modeled lines based on your suggestions:

Max 100GbE showing the theoretical maximum packet rate, based on the notion that the NIC will always transmit at 100Gbps & the MAC will add 24 bytes of per-packet overhead (CRC + Preamble + Gap).
Max PCIe/MLX4 showing the expected PCIe bandwidth limit, based on the notion that PCIe is transferring cache lines at 112Gbps and ConnectX-4 has one cache line per packet of overhead (64B transmit descriptor).

Here is how it looks (click to zoom):

rplot02

This looks in line with the theory of a memory/uncore bottleneck:

Cache line granularity explains the width and alignments of the plateaus.
Performance curve would be smoothed where it reaches line rate, but it never does.
Cache lines are not being delivered to the NIC fast enough to keep the transmitter busy.

One more interesting perspective is to change the Y-axis from Mpps to % of line rate:

rplot03

Looks to me like:

We are delivering ~80 Gbps of packet-data cache lines to the NIC.
If the packet size is a multiple of 64B then all the transferred data can be sent onto the wire, but otherwise part of the last cache line is not used and throughput drops.
There are a couple of sweet-spots around 256B where throughput reaches 85% of line rate. Perhaps more of the cache lines were served from L3 cache vs RAM here.

So the next step is to work out how to keep the PCIe pipe full with cache lines and break the 80G bottleneck.

fmadio commented 8 years ago

Cool, one thing totally forgot is 112Gbps is PCIe Posted Write bandwidth. As our capture devices is focused on Write to DDR I have not tested what the Max DDR Read bandwidth would be, its quite possible the system runs out of PCIe Tags at which point peek Read bandwidth would suffer.

Probably the only way to prefech data into the L3 is via the CPU, but that assumes the problem is L3 / DDR miss not something else. Would be interested if you limit the Tx buffer address to be < total L3 size. e.g. is the problem L3 -> DDR miss or something else.

fmadio commented 8 years ago

Also, for the Max PCIe/MLX4 green line. Looks like your off by 1 64B line some how ?

lukego commented 8 years ago

This is an absolutely fascinating problem. Can't put it down :).

@fmadio Great info! So on the receive path the NIC uses "posted" (fire and forget) PCIe operations to write packet data to memory but on the transmit path it uses "non-posted" (request/reply) operations to read packet data from memory. So the receive path is like UDP but the transmit path is more like TCP where performance can be constrained by protocol issues (analogous to window size, etc).

I am particularly intrigued by the idea of "running out of PCIe tags." If I understand correctly the number of PCIe tags determines the maximum number of parallel requests. I found one PCIe primer saying that the typical number of PCIe tags is 32 (but can be extended up to 2048).

Now I am thinking about bandwidth delay products. If we know how much PCIe bandwidth we have (~220M 64B cache-lines per second for 112Gbps) and we know how many requests we can make in parallel (32 cache lines) then we can calculate the point at which latency will impact PCIe throughput:

delay  =  parallel / bandwidth  =  32 / 220M per sec  =  146 nanoseconds

So the maximum (average) latency we could tolerate for PCIe-rate would be 146 nanoseconds per cache line under these assumptions.

Could this be the truth? (Perhaps with slightly tweaked constants?) Is there a way to check without a hardware PCIe sniffer?

I made a related visualization. This shows nanoseconds per packet (Y-axis) based on payload-only packet size in cache lines (X-axis). The black line is the actual measurements (same data set as before). The blue line is a linear model that seems to fit the data very well.

rplot05

The slope of the line says that each extra cache line of data costs an extra 6.6 nanoseconds. If we assumed that 32 reads are being made in parallel then the actual latency would be 211 nanoseconds. Comparing this with the calculated limit of 146 nanoseconds for PCIe line rate we would expect to achieve around 70% of PCIe line rate.

This is a fairly elaborate model but it seems worth investigating because the numbers all seem to align fairly well to me. If this were the case then it would have major implications i.e. that the reason for all this fussing about L3 cache and DDIO is due to under-dimensioned PCIe protocol resources on the NIC creating artificially tight latency bounds on the CPU.

(Relatedly: The Intel FM10K uses two PCIe x8 slots ("bifurcation") instead of one PCIe x16 slot. This seemed awkward to me initially but now I wonder if it was sound engineering to provision additional PCIe protocol/silicon resources that are needed to achieve 100G line rate in practice? This would put things into a different light.)

wingo commented 8 years ago

Do I understand correctly that these are full duplex receive and transmit tests, and that they are being limited by the transmit side because of the non-posted semantics of the way the NIC is using the PCIe bus?

lukego commented 8 years ago

No and maybe, in that order ;-). This is transmit-only (packetblaster) and this root cause is not confirmed yet, just idea de jour.

Some more details of the setup over in #1007.

lukego commented 8 years ago

The NIC probably has to use posted requests here - read request needs a reply to get the data - but maybe it needs to make twice as many requests in parallel to achieve robust performance.

lukego commented 8 years ago

@wingo A direct analogy here is if your computer couldn't take advantage of your fast internet connection because it had an old operating system that advertises a small TCP window. Then you would only reach the advertised speed if latency is low e.g. downloading from a mirror at your ISP. Over longer distances there would not be enough packets in flight to keep the pipe full.

Anyway, just a theory, fun if it were true...

fmadio commented 8 years ago

A few things.

1) Pretty much all devices support "PCIe Extended Tags" which add a few more bits so you can have alot more transactions in flight at any one time. E.g. think about GPU`s reading crap from system memory ... nvidia, intel & co have a lot of smart ppl working on this.

2) In practice you`ll run out of PCIe credits first. This is a flow control / throttling mechanism that allows the PCIe UnCore to throttle the data rate, so the UnCore never drops a request. For both Posted & Non-Posted requests, it gets split further into credits for headers and credits for data.

3) Latency is closer to 500ns RTT last time I checked, putting it half that going one way. Keep in mind for non-posted reads from system DDR its a request to PCIe UnCore, then response so full RTT is more appropriate. Of course these are fully pipelined requests so 211ns sounds close.

For 100G packet capture we dont care about latency much, just maximum throughput thus havent dug around there much. We`ll add full nano accurate 100G line rate PCAP replay in a few months at which point latency and maximum non-posted read bandwidth becomes important.

4) All of this is pretty easy test with an fpga. Problem is I dont have time mess around with this at the moment.

lukego commented 8 years ago

@fmadio Thanks for the info! I am struck that "networks are networks" and all these PCIe knobs seem to have direct analogies in TCP. "Extended tags" is window scaling, "credits" is advertised window, bandwidth*delay=parallel constraint is the same. Sensible defaults change over time too e.g. you probably don't want to use Windows XP default TCP settings for a 1 Gbps home internet connection. (Sorry, I am always overdoing analogies.)

So based on the info from @fmadio it sounds like my theory from the weekend may not actually fit the constants but let's go down the PCIe rabbit hole and find out anyway.

I have poked around the PCIe specification and found a bunch of tunables but no performance breakthrough yet.

Turns out that lspci can tell us a lot about how the device is setup:

# lspci -s 03:00.0 -vvvv
...
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
        MaxPayload 256 bytes, MaxReadReq 512 bytes
...

Observations:

ExtTag (PCIe extended tags) is supported by the device in DevCap.
ExtTag is however disabled in DevCtl.
More interesting-looking tunables exist in DevCap: MaxPayload and MaxReadReq.

I have tried a few different settings (e.g. ExtTag+ and MaxPayload=512 and MaxReadReq=4096) but I have not observed any impact on throughput.

I would like to check if we are running out of "credits" and that is limiting parallelism. I suppose this depends on how much buffer space the processor uncore is making available to the device. Guess the first place to look is the CPU datasheet.

I suppose that it would be handy to have a PCIe sniffer at this point. Then we could simply observe the number of parallel requests that are in flight. I wonder if there is open source Verilog/VHDL code for a PCIe sniffer? I could easily be tempted to buy a generic FPGA for this kind of activity but a single-purpose PCIe sniffer seems like overkill. Anyway - I reckon we will be able to extract enough information from the uncore performance counters in practice.

BTW lspci continues with more parameters that may also include something relevant:

DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
         Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-

lukego commented 8 years ago

Just a note about the server that I am using for testing here (lugano-3.snabb.co):

CPU: E3-1650v3 (6 cores @ 3.5 GHz, 15MB L3 cache)
NIC: 2 x ConnectX4 100GbE (each in a separate PCIe 3.0 x16 slot)
RAM: 4 x 8GB DDR4 (2133 MHz)

Could be that we could learn interesting things by testing both 100G ports in parallel. Just briefly I tested with 60B and 1500B packets. In both cases the traffic is split around 50/50 between ports. On 1500B I see aggregate 10.75 Mpps (well above single-port rate of ~6.3 Mpps) and on 60B I see aggregate 76.2 Mpps (only modestly above the single-port rate of 68 Mpps).

fmadio commented 8 years ago

HDL these days is almost entirely packet based, all the flow control and processing inside those fancy asic`s are mostly packet based. So all the same algos are there, different names and format but a packet is still a packet regardless of it contains a TCP header or QPI header.

Surprised the devices shows up as x16, means you`ve got PLX chip there somewhere acting as bridge. It should be 2 separate and distinct pcie devices.

You can`t just make a PCIe sniffer, a bridge would be easier. You realize a Oscilloscope that capable of sampling PCIe3 signals will cost $100-$500K ? those things are dam expensive. PCIe sniffer will "only" cost a meager $100K+ USD.

On the FPGA side monitoring the credits is pretty trivial. Forget if the Intel PCM kit has anything about PCIe Credits or monitoring Uncore PCIe FIFO sizes. Its probably there somewhere, so if you find something would be very cool to share.

lukego commented 8 years ago

@fmadio ucevent has a mouth-watering list of events and metrics. I am working out which ones are actually supported on my processor and to untangle CPU/QPI/PCIe ambiguities. Do any happen to catch your eye? This area is obscure enough that googling doesn't yield much :-).

kbara commented 8 years ago

That is incredibly shiny.

fmadio commented 8 years ago

wow im blinded by the shinyness, very cool.

R2PCIe.* looks interesting, would have to read the manual to work out what each actually mean.

lukego commented 8 years ago

Coming full circle here for a moment, the actionable thing is that I want to decide how to write the general purpose transmit/receive routines for the driver:

Optimize for CPU-efficiency.
Optimize for PCIe-efficiency (w/ extra work for the CPU).
Something else e.g. complicated heuristics, knobs, etc.

The default choice seems to be (1). However this may not really provide more than ~70G of dependable bandwidth. It would be nice to have more than this with a 100G NIC.

If the source of this limit could be clearly identified then it may be reasonable to work around it in software e.g. with extra memory access to ensure that the TX path is always served from L3 cache. However, without a clear picture this could easily do more harm that good, e.g. by taking cache resources away from the receive path that I have not benchmarked yet.

Mellanox's own code seems to be more along the lines of (3) but the motivations are not completely transparent to me. I am told that they can achieve up to 84 Mpps with 64B packets but I am not sure what this means e.g. if performance drops steeply when switching from 64B to 65B packets. (The utility of 64B packet benchmarks is when they show "worst case scenario" performance but in my tests so far this seems more like an unrepresentatively easy workload for the NIC which may be limited by I/O transfers rather than per-packet processing.)

lukego commented 8 years ago

Just another low-level detail that I noticed in passing but don't want to dig into right now:

The Xeon E5 data sheet (volume 1, volume 2) includes register definitions for extremely detailed settings like allocation of PCIe credits. Example:

screenshot - 091316 - 12 58 03

I believe that these settings are traditionally only tweaked by the BIOS but they seem to be accessible to Snabb. The Xeon Uncore presents its configuration options as virtual PCIe devices. The example above says "bus 0, device 5, function 0" which is PCI address 00:05.0:

$ lspci -s 00:05.0
00:05.0 System peripheral: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Address Map, VTd_Misc, System Management (rev 02)

and those registers can be accessed via sysfs the same way that we access the NIC. For example to read that PCIe egress credits register using a generic command-line tool:

$ sudo setpci -s 00:05.0 840.l
8c415b41

Could come in handy one day.

fmadio commented 8 years ago

cool thanks for sharing.

might also find the following discussion relevant. Uncore freq scaling really only helps for extreme microbursts and likely unrelated to anything here. Would suggest tho is monitoring your uncore frequency in realtime, its pretty trivial via MSR`s

https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/600913

lukego commented 8 years ago

I have done some experimenting with uncore counters and L3 cache behaviors. The counters seem to be working, I can measure the average access times on the memory subsystem (CBo Table of Requests) and see impact on these with driver changes, but I have not found any direct link to performance. However...

Here is a "that's funny..." issue that I neglected to pause and consider earlier:

These "plateaus" of ours have been explained as cache-lines, which has seemed natural because the width matches and this is the unit of granularity of the memory subsystem that is feeding PCIe. However, if we look more closely we can see that the alignment of the plateaus is off by exactly 4 bytes. The drops are not happening at 64, 128, 192 but rather at 60, 120, 188.

Here is a zoomed in view of a small segment to illustrate:

rplot06

We might have expected the plateaus to end at 512 and 576 but in fact they are at 508 and 572. This pattern is consistent throughout the test results.

How to explain this?

Is there something on the PCIe messaging layer that would account for the 60-byte alignment?

If not then could it point to an issue on the NIC? One of the next processing steps after DMA will be to calculate and append a 4-byte CRC to the packet so a packet that is 60B for the CPU and PCIe will at some point be 64B in the NIC.

Whaddayasreckon?

lukego commented 8 years ago

One more alignment issue that I neglected to mention earlier because it is complex :).

I have been counting the ConnectX-4 transmit descriptor as 1 cache line of overhead. However the situation is a little more complex than this:

Transmit descriptor is 64B.
Transmit descriptor includes 16B of "inline" payload (copied by the driver).
DMA pointer starts at packet.data+16 to skip the bytes already in the descriptor.

So what overhead does this translate to on the PCIe layer?

Answer 1: 48 bytes.The descriptor is 64B but we skip DMA'ing the first 16B of the payload.
Answer 2: 64 bytes. The packet.data address is aligned to a cache line, and PCIe is transferring whole aligned cache lines, so the first 16 bytes are transferred both in the descriptor and the payload (as padding).

which makes most sense? or something else?

fmadio commented 8 years ago

to be clear its something like this

struct { 40b descriptor stuff / size / blah 8b payload io addresss 16b first payload }

struct { 60B of payload data (including duplicated 16bit in descriptor) } (e.g. for 60B payload)

What happens if every single Tx packet you send, you make the payload IO address (in the descriptor) split across a 64B boundary? and compare it to nicely aligned single cache line address.

This is a nice pic from anandtech.

http://images.anandtech.com/doci/8423/HaswellEP_DieConfig.png?_ga=1.10720884.1957720941.1472434760

fmadio commented 8 years ago

for the graphs. when you say its a 508B packet, do you mean 508B of payload + 4B FCS? or do you mean 508B includes the FCS.

Its also possible the controller by default assumes the FCS is in the payload size, then the descriptor has a flag to tell it generate it. In the 99.9% of case the extra 4B fetched over DMA is 0x00000000/garbage, but when you disable hardware FCS the driver or kernel or someone calculates it

lukego commented 8 years ago

I am running some new tests to see how the edge of a plateau is affected by each DMA alignment 0..63.

Interesting idea that the NIC could be fetching 4B extra. That would explain the alignment of the plateaus. This would be very rude though :-) because the NIC would have to fetch data beyond the address boundaries that I am supplying.

PacketSize in the graph is always excluding FCS. The documented behavior is that the FCS is always added by the device after DMA, but there may be other undocumented options too. (On RX there is the option to strip or keep the FCS.)

The way I am using descriptors is like you say except that the first 16B is only inline in the descriptor and not included in the separate payload data. (In memory it is all in one place and so I bump the payload pointer in the descriptor to skip the bytes that are already inline.)

The ConnectX-4 descriptor format is actually made up of different kinds of segments:

struct control     { ... }; // 16B
struct ethernet    { ... }; // 32B incl. some inline payload
struct data_gather { ... }; // 16B length+pointer to a payload fragment
struct data_inline { ... }; // variable size payload in the descriptor ring

Then for each packet you have one control segment, then one ethernet segment, then one or more gather and/or inline segments.

Snabb style is keep-it-simple so the payload is always delivered with a single data gather segment. This makes each descriptor exactly 64B and puts the first 16B of payload in the descriptor. (Inlining the ethernet header seems to be mandatory.)

Relatedly...

I am looking at Mellanox's latest descriptor-managing code for reference. They now have a new implementation in DPDK in addition to their OFED and Linux kernel ones. Here are a few notes:

Quite a bit of code: more than 1000 LOC. This bugs me since the descriptor rings are basically just lists of pointer+length buffer locations. But that is me.

They have four separate implementations of the transmit path: mlx5_tx_burst, mlx5_tx_burst_inline, mlx5_tx_burst_mpw, mlx5_tx_burst_mpw_inline. So basically they have two optional features, inline and mpw, and they have written separate transmit routines for each option combination.

The default for 100G is mlx5_tx_burst and for lower speeds ("Lx" ASIC) is mlx5_tx_burst_mpw. Seems the mpw feature is not available on 100G.

The inline mode can be selected by user configuration. If enabled this means the driver will copy the entire packet into the descriptor area. Their recommendation is to do this if you are more worried about PCIe back-pressure than CPU capacity. (Guessing this only makes sense for really simple programs that don't need their CPU and L3 resources for other things e.g. packet forwarding benchmarks.)

The mpw mode is interesting. This hardware feature is completely undocumented as far as I can tell. From reading the code it looks like a feature for compressing transmit descriptors by packing them together in groups of 5. This could be nice since the descriptors are quite bloated. On the other hand I am concerned about line 1014. If I am interpreting this correctly then this optimization only works for consecutive packets that are exactly the same size. Is that the correct interpretation? If so then this would run the risk of performing differently in benchmarks than in real life. May be worth reporting as a bug in that case.

lukego commented 8 years ago

@fmadio btw thanks for the link to the Intel forum about people seeing strange PCIe performance with a CPU much like mine (low-core-count Haswell). It would be nice to rule out an issue like this but I don't immediately have another server with 100G to compare. For now I have taken the basic precaution of installing the latest CPU microcode release (no observable change).

I do have different servers with 10G/40G cards that I will test, but in due course...

virtuallynathan commented 8 years ago

Wonder if Mellanox has any ConnectX-5 cards available you could compare with for shits and giggles.

lukego commented 8 years ago

Wow. Mellanox really do have special-case optimizations for when consecutive packets are the same size. This kind of optimization would seem to reduce the utility of simple benchmarks based on constant-size packets for predicting real-world performance.

This specific case only applies to the Lx ASIC (non-100G) and when using the DPDK driver. However, if vendors are putting this kind of benchmark-targeted optimization into their silicon then perhaps it is not useful to test with fixed-size packets and we need to use e.g. IMIX workloads as @mwiget always does.

plajjan commented 8 years ago

I think tests with fixed and variable sized packets are useful, if nothing else to spot patterns just like the ones I imagine will happen due to the optimization you mention Mellanox has implemented.

lukego commented 8 years ago

I am running a new benchmark that slightly varies the packet sizes. Instead of 100,100,100,100,... it sends 100,99,100,101,... so the average is the same but the value is not perfectly predictable. Expectation is that this will not affect performance but worth checking.

I have the results for different alignments now. This is a 92K row CSV to separately test every packet size and every alignment :). The results are actually quite satisfying.

Here is a broad overview of all results with each alignment being plotted in a different color. The alignment 0..63 is the offset of the first byte of packet data from the beginning of a cache line.

rplot08

Here we can see that alignment does have some effect at smaller packet sizes (below ~320) but no effect at higher sizes.

Here is the happy news: it turns out that the confusing "up and down" results we have seen in the past can be explained by data alignment. Here we compare performance with 0B alignment (as in the original tests) compared with 48B alignment:

rplot09

The red line (0B alignment) is bouncing up and down confusingly while the blue line (48B alignment) is following an orderly stepwise pattern.

There is also a reasonable explanation for why 48B alignment works better. The first 16B of the packet is being inlined into the descriptor and the DMA for the rest is starting at address+16. So if we allocate our packets directly on a cache line then the DMA will be 16B aligned. But if we allocate our packets at a 48B offset then the DMA will be perfectly aligned with the start of the second cache line.

Food for thought. It seems like we may want to consider allocating our packets on a different alignment when we are working with ConnectX-4 NICs. This would also have implications for work like igalia/snabb#407 where the alignment of a packet can be moved during processing (e.g. when repositioning the start of a packet to efficiently add/remove a header.) If this seems like too much trouble then we could reconsider the possibility of using "inline descriptor" mode i.e. copying all packet data into the descriptor ring which should maximize I/O performance at the expense of CPU.

lukego commented 8 years ago

I am thinking seriously about indirect descriptors. The cost would be a full packet copy on both the transmit and receive paths. The benefit would be increased PCIe efficiency.

This would seem to be in line with the end to end principle in that it would isolate all the complex optimizations, e.g. weird DMA alignment requirements, locally inside the driver. This driver code could be optimized aggressively e.g. written in assembler if necessary.

The reason I suggest this possibility is that the performance we are seeing is really meh. Just now it looks like we can only depend on around 70% utilization of the link (see graph above). This is already low and likely to drop further when we test with full-duplex. So maybe it is time for desperate measures unless we can find an alternative boost.

lukego commented 8 years ago

Just another thought...

PCIe efficiency squeeze could be the new reality in the 25G/50G/100G era. 10G NICs usually have 16G of PCIe bandwidth (x2) while 100G NICs have 128G bandwidth (x16). So the ratio of ethernet bandwidth to PCIe bandwidth has dropped from 1.6x to 1.28x. This is unlikely to improve e.g. PCIe 4.0 will give us 256G bandwidth but we will want to use that for 200G (2x100G) NICs.

Or alternatively the world may decide that this PCIe bandwidth margin is too tight and that future NICs should have twice as many lanes for the same number of ethernet ports.

I wonder which way that wind will blow?

fmadio commented 8 years ago

Congrats on finding the reason for non pow2 packet size stepping. Makes total sense.

Would expect 100G NICs to go Gen4x8. Tho, that assumes the CPU will provide the same number of lanes as the Gen3 chips. Given that most NICs cant do line rate anything on the first few generations, its not that surprising.

plajjan commented 8 years ago

@lukego probably depends on use case but given pps of applications and the CPU consumed for that it would seem to me that at least we SPs doing VNFs would be more inclined to have a CPU per 100G. That's ~24 cores now/soon. 148Mpps / 24 = ~6Mpps per core so overall a CPU per 100G seems quite a good fit. With a dual socket machine we'd use two PCI cards and one 100G port per NIC. Thus increase in PCI bandwidth would be useful for us.

If you go talk to DC guys I am sure they will have a different story because they do more storage with large packets where pps doesn't matter but throughput will, so they are more inclined to max out PCI bus bandwidth than CPU.

lukego commented 8 years ago

... Just arguing the other way again, thinking out aloud ...

Reasons to resist the inline descriptor mode:

The efficiency is not necessarily much better in practice. In principle you only need about 2 bytes of overhead (16-bit length followed by payload, repeat) but in practice on ConnectX-4 there is about 40B of metadata and up to 63 bytes of padding (64B alignment). So the amount of data being transferred over PCIe would be much the same. The data would be laid out linearly which may or may not help.
Could be that ~70% of line rate is acceptable when we are looking at the first NIC on the market.
Could be that optimization efforts are better spent on other cards e.g. the ConnectX-5 that seems to have been released now (?) that promises better performance and may support new DMA styles.

@plajjan How about the alternative, as @sleinen alluded to earlier, of having 2x100G per processor and expecting to do ~ 70% utilization of each? Then you would have 280G of bandwidth per dual-socket server. However you would need to use a switch to spread load across your ports rather than plumbing them directly into a $megabucks 100G router port. (Do high-end routers offer compatible 100G ports yet, anyway?)

lukego commented 8 years ago

Just to put the DMA alignment issue into context, here is the difference when the Y-axis is the % of line rate achieved:

rplot10

So it is a definite improvement but the impact is only really seen with small packet sizes where performance is not practical in either case. Looks like you have to hit around 320B per packet before you can depend on ~70% of line rate.

lukego commented 8 years ago

@plajjan Continuing to think out aloud...

Reality check: Each CPU has 320G of PCIe bandwidth (40 lanes @ 8Gbps). PCIe bandwidth is the least scarce resource at the hardware level. On a fundamental level it is crazy that we are talking about burning CPU resources (always precious) to conserve PCIe bandwidth when most of our PCIe lanes are not even connected.

So this is actually a sign of a problem somewhere else e.g. limitations in the NIC silicon or that PCIe 3.0 x16 is not a suitable choice for 100GbE. In the latter case the solution could be PCIe 4.0 or e.g. a special riser card that allows the card to consume two x16 slots.

plajjan commented 8 years ago

@lukego going out and buying some switch to aggregate some 70% 100GE PCI NICs feels like a hack and I don't think the vision for Snabb's 100G support should be based on the premises of a hack. Isn't it trading (relatively) expensive switch port for cheap PCI lanes?

There are certainly compatible router ports out there. Did you mean competitive? Still more expensive by quite a fair margin.

Blargh, why can't I read all the comments before writing... yes, PCI seems least scarce, so I'd rather be a bit wasteful on that side.. but like I said; DC guys probably want to use that I/O!

snabbco / snabb

100G Packet Rates: Per-CPU vs Per-Port #1013

DMA/PCIe

Ethernet MAC/PHY