snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.97k stars 300 forks source link

ConnectX: Review N*SQ 64B transmit performance mellanox (Rev 2) #1007

Open lukego opened 8 years ago

lukego commented 8 years ago

This benchmark report for single-core transmit ("packetblaster") over multiple Send Queues superceeds #1006. This is based on a new Snabb version that performs better due to different DMA memory allocation.

download 2

These results look much clearer and simpler.

Summary:

lukego commented 8 years ago

I am reading an absolutely fascinating book called Statistical Modeling: A Fresh Approach. It's about how to make models (functions) that succinctly account for the variations in data sets. The notion is that formal models can be used to capture details that would be too complex to present visually.

I am using this ConnectX-4 data set as a first modeling exercise. The idea is to make a model that accounts for how performance (Mpps) depends on other variables (focusing on SendQueues which has the biggest effect).

Here is what the raw data looks like. (The points appear in groups because several different queue lengths are tested.)

raw

The exercise here is to make a series of models that try to account for the Mpps values.

Constant

The first naive model supposes that Mpps is always the same. This can be expressed by taking a linear model (lm) for the formula Mpps ~ 1 and "fitting" that with the value that best suits this data:

> m1 <- lm(Mpps ~ 1, data = d)
> coefficients(m1)
(Intercept) 
   60.11094 

This model predicts that performance will always be 60.1 Mpps. We can visualize the model as a line overlayed on the data:

constant

On the one hand this is useful information but on the other hand it does not account for any of the variations in the measured values.

Linear

The second model supposes that Mpps increases linearly i.e. following a straight line. The line is defined by the "intercept" (the notional Mpps with 0 send queues) and a slope (how much Mpps changes for each added send queue).

> m2 <- lm(Mpps ~ SendQueues, data = d)
> coefficients(m2)
(Intercept)  SendQueues 
 49.6810481   0.8343916 

linear

This model predicts that baseline performance is 49.681 Mpps and then increases by 0.834 Mpps for each send queue.

This may or may not be a step in the right direction but it definitely does not fit the data very closely. The goodness of the fit can be quantified with a statistic called "R squared" that tells us what fraction of the variation in Mpps values is accounted for by the model. The answer here is 25%.

> summary(m2)$r.squared
[1] 0.2506815

Segmented

The last model supposes that Mpps follows a segmented line. The effect of adding send queues changes at a certain "break" point.

> m3 <- segmented(lm(Mpps~SendQueues-1, data=d), seg.Z=~SendQueues)

segmented

The "fit" tells us that we have two lines with different slopes. Initially each send queue accounts for a 15.8 Mpps increase in performance. Later each send queue accounts for a slight decrease of 0.14 Mpps. The "break point" is after ~4 send queues.

> slope(m3)
$SendQueues
          Est.
slope1 15.8100
slope2 -0.1417
> summary(m3)
...
Estimated Break-Point(s):
   Est. St.Err 
 4.153  0.021 
...
Multiple R-Squared: 0.9998,  Adjusted R-squared: 0.9998 

This model looks much more satisfying to me. This is quantified in the R-squared statistic that says we are now accounting for 99.9% of the variation in Mpps. We could add more details to the model to make it fit the data even better, but for the purpose of this exercise I feel that we have reached the point of diminishing returns.

Summary

So there we have it: our best model of ConnectX-4 performance says that the first four send queues give you 15.8 Mpps each and adding more beyond that has a slightly negative effect.

I believe that Mellanox claim the maximum packet rate for this card is around 90 Mpps. This leads me to think that there are some other factors that we could include in our tests - descriptor formats, etc - that would produce values that don't fit this model. Future steps are to identify these factors, measure them, and update the model to account for them.

Reflections

Just now it seems to me like statistical modeling is a promising approach to characterizing system performance. I would be very happy if network equipment like NICs would be supplied with models like this to tell me what performance to expect in different configurations. I would be even happier if the constants in these models were derived directly from reproducible benchmarks on a CI system.

Maybe in the future Snabb applications could come with such models that tell you what performance to expect based on factors like software version, configuration, clock speed, number of cores, choice of NIC, etc.

plajjan commented 8 years ago

I just want some clarification. This is on a single core, right? And that core is not saturated? So we can't expect a performance increase simply by using more cores, correct?

plajjan commented 8 years ago

Also very good job. Both on the driver implementation and on the analysis. It's always very interesting to read material from writers enthusiastic about a subject. You clearly are and it shows. A very interesting read. Interesting results and interesting ideas for the future.

lukego commented 8 years ago

@plajjan Thanks!

This benchmark is an attempt to establish the performance characteristics of the NIC itself. The tests are always with a single CPU core driving all of the send queues. I have used an especially efficient special-case transmit routine ("packetblaster") to prevent a CPU bottleneck. I have also sanity-checked this by running the test at both 3.5 GHz and 2.0 GHz and not seeing a significant difference in the results. (It would be interesting to add multiple clock speeds into the experiment and then use the model to quantify any effect this may have.)

lukego commented 8 years ago

I think that I should repeat this testing and modeling exercise adding a couple more factors into the tests:

More?

Informally it seems like clock speed does not have much effect (would confirm that the test is not CPU-bound) and that packet size has a surprisingly large effect (e.g. 200B packet gets only ~37 Mpps). Could be that the "packets/sec vs bytes/packet" curve is actually problematic in the sense of #1013. Have to model that to find out :-).