Closed lukego closed 11 years ago
Here are additional things you need to do in order to run snabbswitch and reproduce the problem:
In the likely event that the NIC you want to test with does not have PCI address 0000:00:04.0 then you need to substitute the actual address in the command above, and in src/selftest.lua. Use lspci -v to find the right card.
Here is an update on how to experiment, written in response to the first interested user!
The Snabb Lab: I have two EX6 servers colocated at hetzner.de that each have one spare Intel ethernet port and have a cross-cable connecting them together. The hostnames are arbon.snabb.co and bern.snabb.co. I'm currently developing inside a VM on arbon and occasionally running tcpdump/dstat/ifconfig/etc on bern to check what's being output.
To connect to the development VM (running Ubuntu v12 "cloud image") on arbon use: ssh -p 54322 $user@arbon.snabb.co and to connect to bern use: ssh $user@snabb.co
where $user is an account that I have created for you (just leave a comment here with your ssh public key if you want to have one for trying out the switch).
Given today's selftest workload I feel the NIC is only in use about 1% of the time so this lab setup should be able to scale up to a few people. Here's an important tip to avoid multiple people running snabbswitch at the same time (which would be very confusing): instead of "./snabbswitch" always run "flock -x /tmp/snabb.lock ./snabbswitch". This way only one process will run at a time and any extra ones will automatically block until they have a turn.
Please leave the servers as you found them! no wild global configuration changes etc. do whatever you want in your home directory and feel free to sudo and install software you need etc. Mail luke@snabb.co if something crashes or needs rebooting, no stress :)
rahul@serverstack.info commented by mail that the Intel e1000 driver in Linux is very well-debugged and high-quality code. definitely a good resource for comparing to understand where the snabb switch bug is.
Great!
I see that changes I made to the selftest procedure (now calling selftest2() instead of selftest() in intel.lua) means this Issue doesn't reproduce out of the box. Sorry about that. I will try to make a fix now so that running the switch reproduces the problem again. Update to follow.
Does the code compile and run for you btw?
Hi Luke,
On 01/07/2013 07:56 PM, Luke Gorrie wrote:
I see that changes I made to the selftest procedure (now calling selftest2() instead of selftest() in intel.lua) means this Issue doesn't reproduce out of the box. Sorry about that. I will try to make a fix now so that running the switch reproduces the problem again. Update to follow.
Does the code compile and run for you btw?
I can confirm that the selftest2() procedure seems to be working, but the selftest() procedure is not printing any statistic.
Regards, Rahul
Wow cool that it runs! :-) You are the second person after me to run the switch!
Looks like I have broken selftest() quite a bit with recent hacking. I will now extend selftest2() to also support receive and then we can try to reproduce the problem with that.
Does the code make any sense btw? I am still learning Lua and I think especially the way I'm doing object-oriented programming - lots of "M." prefixes - is a bit clunky and can be better.
On 01/07/2013 08:20 PM, Luke Gorrie wrote:
Wow cool that it runs! :-) You are the second person after me to run the switch! He He 8-) Looks like I have broken selftest() quite a bit with recent hacking. I will now extend selftest2() to also support receive and then we can try to reproduce the problem with that. OK. Does the code make any sense btw? I am still learning Lua and I think especially the way I'm doing object-oriented programming - lots of "M." prefixes - is a bit clunky and can be better.
I haven't played with lua much (I mostly program in python, D). Since Lua wasn't really designed for heavy-duty OOP, I guess it'll always look a bit awkward (but hey, it works!). The code does make sense btw :-)
Looking forward for the updated selftest.
Regards, Rahul
OK! The updated selftest is checked in now with commit b3867caff51e261769822bb6d55de6a37947884d.
Now selftest2() is extended to also handle RX and is renamed to selftest() replacing the old one.
The problem that shows up now is that the transmit+receive+loopback test drops most of the packets. Do you see this too? I don't know why that is but it's a bug that would be good to fix. Welcome to have a look :). Probably best to create a new Issue.
The original problem from this issue doesn't seem to be reproducible now? Could be that it was fixed by changes to the logic that says when descriptor rings are full/empty (I think I fixed stuff there last week), or could be that it still exists and I'm just not seeing it.
btw: another interesting but larger thing to hack on in this source file is the add_txbuf_tso() function that is currently just a stub. The goal is to use the TCP segmentation offload hardware features so we would have a test case transmits really big packets (~64K) and then (by loopback) receives the same data back in more smaller packets. This would be a major step towards implementing STT in the future (possibly being the first open source implementation...)
Dinner time over here! :-)
I'm having a look at the updated selftest. Will report any interesting findings.
Regards, Rahul
OK, Found something interesting:
File: intel.lua ; function init_receive(): Line 252:
regs[RXDCTL] = bits({GRAN=24, WTHRESH0=16})
this line which sets Receiver Descriptor Control (RXDCTL) register was commented out. Un-commenting the line, has drastically cut down the Missed Packets Count, while increasing the Receive No Buffers Count.
BEFORE:
Statistics for PCI device 0000:00:04.0:
1,109,458 MPC Missed Packets Count
80,667 PRC64 Packets Received [64 Bytes] Count
80,667 GPRC Good Packets Received Count
1,190,213 GPTC Good Packets Transmitted Count
5,162,688 GORCL Good Octets Received Count
76,174,336 GOTCL Good Octets Transmitted Count
2 RNBC Receive No Buffers Count
76,176,896 TORL Total Octets Received (Low)
76,177,408 TOTL Total Octets Transmitted (Low)
1,190,279 TPR Total Packets Received
1,190,283 TPT Total Packets Transmitted
1,190,286 PTC64 Packets Transmitted [64 Bytes] Count
AFTER:
Statistics for PCI device 0000:00:04.0:
232,479 MPC Missed Packets Count
818,720 PRC64 Packets Received [64 Bytes] Count
818,734 GPRC Good Packets Received Count
1,051,232 GPTC Good Packets Transmitted Count
52,399,680 GORCL Good Octets Received Count
67,279,488 GOTCL Good Octets Transmitted Count
24 RNBC Receive No Buffers Count
67,281,472 TORL Total Octets Received (Low)
67,281,856 TOTL Total Octets Transmitted (Low)
1,051,283 TPR Total Packets Received
1,051,286 TPT Total Packets Transmitted
1,051,289 PTC64 Packets Transmitted [64 Bytes] Count
Maybe the thresholds in RXDCTL register needs adjustment?
Great! Thanks!
So I'm a Github newbie and I'm curious to see how it works. Do you think you could send that fix over as a "Pull request" so that we can test the workflow?
OK, I've sent a pull request: https://github.com/SnabbCo/snabbswitch/pull/31
Great, it worked fine! :-D
Congratulations you are the first contributor of a patch :-)
On 01/09/2013 07:01 PM, Luke Gorrie wrote:
Congratulations you are the first contributor of a patch:-)
Yay! :-)
Luke, do you think this issue should be closed as it is no longer reproducible as originally described?
Yes.
The Intel 82574L device driver's self-test function is showing unexpected counter values. The test attempts to transmit 100,000 packets, and optionally to receive them again with loopback mode. Displaying the hardware counters shows some values that are expected and others that are surprisingly low.
Here are example results when attempting to transmit 100,000 packets of 1000 bytes each:
Why do some counters show 100,000 and others only 54,306?
Here are similar results with MAC loopback mode engaged:
Curious that TPR shows 100,000 while TPT, GPRC, GPTC all show less.
Here's how to reproduce the problem:
Checkout and build snabb switch:
$ git clone --recursive git@github.com:SnabbCo/snabbswitch.git $ cd snabbswitch $ make ... Firmware: 536K snabbswitch
Run self-test
$ src/snabbswitch
Here are ideas for how to investigate: