snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.97k stars 300 forks source link

X540-AT2 driver selftest failing #779

Open harishiitd opened 8 years ago

harishiitd commented 8 years ago

Hello all, While trying to run the selftest for X540-AT2 driver by commenting the PF initialization part, facing the below issue.

"------- Send a bunch of packets from Am0 half of them go to nicAm1 and half go nowhere link report: 53,475 sent on nicAm0.tx -> sink_ms.in1 (loss rate: 0%) 7,622 sent on nicAm1.tx -> sink_ms.in2 (loss rate: 0%) 52,444 sent on repeater_ms.output -> nicAm0.rx (loss rate: 0%) 2 sent on source_ms.out -> repeater_ms.input (loss rate: 0%) mq_sw: wrong proportion of packets passed/discarded"

Please give some inputs on this behaviour.

Regards Harish

lukego commented 8 years ago

I believe the issue is a bug in the selftest function (not in the driver). See also this thread. Basically the selftest function assumes that the NIC will come to "link up" very quickly but the copper 10G NICs take a second or two.

The selftest function in the bug description is sending packets to one port and expecting half of them to arrive on the other port (but none do because the link is not up yet).

Fix would probably be to make selftest wait for linkup before sending packets (e.g. polling the driver to ask if the link is up or with a model-specific delay. Currently we don't really have a framework for scheduling events e.g. "wait for event from app and then do " but maybe that would be handy. I believe that Scratch has a simple model that we could consider borrowing.

Harish, what is your goal here? If you want to run an application then I think you are okay to ignore this selftest failure. If you want to fix the selftest bug we can help you work out how to do that :).

harishiitd commented 8 years ago

HI Luke, Thanks for the inputs. It helped me to understand the snabb 10g driver better. My goal is to fix selftest bug and run it successfully with X540-AT2 nic .

When I tried to debug , i could see that there is wait loop checking linkup state, and we are starting traffic only aftter reading the linkup bit of "LINKS" register (bit no 30) in the init functions of both PF and SF. code sample: _ local mask = bits{Linkup=30} if band(self.r.LINKS(), mask) == mask then return self

So my assumption is that link status is not an issue. Please correct me if i am wrong.

Harish

harishiitd commented 8 years ago

Hi Luke, It looks like the test cases involving VFs in selftest are failing (mq_sw and mq_sq). Testcase for sending traffic between two SFs(sq_sq) is passing.

Is there any difference in VF initialization between 82599 and X540. Or anything need to be taken care for X540 VFs?

Regards Harish

lukego commented 8 years ago

@harishiitd Good question. I need to make time to check into this.

harishiitd commented 8 years ago

Hi Luke, Looks like there is a difference in VF initialization for X540. In section "4.6.11.3.3 DCB-Off, VT-On" of datasheet The value to be filled in RXPBSIZE[0] register is "0x60000" (0x180 << 10) For 82599 this value is "0x80000" (0x200 << 10).

After making this change , all test cases in driver selftest are passing.

here is my workspace changes: diff --git a/src/apps/intel/intel10g.lua b/src/apps/intel/intel10g.lua index 05852ed..d52504d 100644 --- a/src/apps/intel/intel10g.lua +++ b/src/apps/intel/intel10g.lua @@ -725,7 +725,7 @@ function M_pf:set_vmdq_mode () self.r.MTQC(bits{VT_Ena=1, Num_TC_OR_Q=2}) -- 128 Tx Queues, 64 VMs (4.6.11.3.3) self.r.PFVTCTL(bits{VT_Ena=0, Rpl_En=30, DisDefPool=29}) -- enable virtualization, replication enabled self.r.PFDTXGSWC:set(bits{LBE=0}) -- enable Tx to Rx loopback

The final change could be depending on the device model fill the appropriate value in RXPBSIZE[0] register.

Regards Harish

lukego commented 8 years ago

That sounds like a great catch!!

Can you send this change in a Pull Request? You can put [wip] at the start of the title if the fix is not complete and you want some more time before it is merged.