linux4wilc / driver

DEPRECATED: Updated Linux drivers for the ATWILC1000/ATWILC3000 products are located at https://github.com/linux4microchip/linux/tree/master/drivers/net/wireless/microchip/wilc1000. To simplify development, the legacy Linux4WILC was merged into the Linux4Microchip repo where driver development continues (Please refer latest ATWILC1000/ATWILC 3000 Wi-Fi Link Controller Linux User Guide) Driver code for Microchip ATWILC Wireless Devices (ATWILC1000 & ATWILC3000)
https://www.microchip.com/wwwproducts/en/ATWILC1000
32 stars 20 forks source link

RX packets queuing up in wilc1000 firmware #25

Open HDC67 opened 5 years ago

HDC67 commented 5 years ago

This is going to be a hard one to explain so bear with me. It looks more like the firmware but with no documentation on the firmware or the exact interface it's difficult to debug any further. It may be a setting made by the driver. There are a lot of magic numbers in the driver.

SAMA5D42 with WILC1000 module. Kernel 4.9.52 with latest -dev kernel module and latest -dev firmware. Device is connected via 4 bit SDIO.

Basically when under some amount of load, received packets seem to get queued up in the firmware with no indication there is any received data available when looking at the interrupt pending register.

tcpdump.wilc1000 is captured on the unit with the WILC1000 module. Note this is filtered to only capture the iperf packets i.e. port 5001. tcpdump.wilc1000.pcapng.zip tcpdump.pc is captured on a Windows PC (running the iperf server) tcpdump.pc.pcapng.zip

PC is 10.1.2.8 WILC1000 is 10.1.2.109

The module is doing an iperf -c to the PC i.e. the WILC1000 is sending data to the PC via TCP the PC will mainly be sending back TCP ACKs. Unfortunately times are not synced however you can calculate the time difference at based on the SYN and SYN ACK packets at the beginning of the capture. The PC is therefore ~26.101s ahead so add to the wilc1000 or subtract from the PC packet capture.

In this case it starts well (sometimes it's the second or third ACK when it starts going bad) until packet #224 at the PC which is an ACK with Seq=213185, sent at 14:59:21.548 which is approximately 14:48:55.447 in module time. However this packet does not arrive until 14:58:58.107 i.e. more than 2.5 seconds late. By this stage TCP is timing out on ACKs and thoughtput falls apart. There is a large gap around 14:58:57 on the module where nothing is happening (no tx) as TCP is probably backing off a bit waiting for the ACKs.

Eventually something kicks it along and the module/firmware/driver recognises the packet is there and a whole bunch of ACKs come through (packets 241-248) in the module capture in one hit. From my debugging it seems it's the firmware holding them as I have the driver/SDIO system polling and SDIO_CCCR_INTx shows as nothing pending (0x00). Eventually it reads 0x02 and the driver finds all the ACKs queued up.

The ACK is not retransmitted by the PC so it must be sitting there in the module RAM and the firmware is not indicating it exists.

So called config packets appear to come though OK and one can be seen at 14:48:57 (line 2182 of syslog) but packets transmitted some 1.5-2 seconds before this are not available. The ACKs all appear at from line 2246 lining up with all the ACKs in the tcpdump at 14:58:58.

I have also a syslog with a large amount of debugging showing the ACK packets all coming through in a big batch at 14:58:58 module time. It includes things like the SDIO_CCCR_INTx register being polled. When there's only a single entry that means a real interrupt was generated. The optimisation at sdio_irq.c/process_sdio_peding_irqs() is removed so it always reads the pending value once even if there's only one interrupt source. Where there's a double read that means it polled. So the interrupt process seems to be working on the host CPU and simply the firmware is not sending the IRQ.

syslog.txt

HDC67 commented 5 years ago

Attached is a wifi monitor packet capture showing some transmits from the module which may be the root cause of the problem.

At packet 26612 you can see it's retrying. There is no SN=2928 packet received

Packet 26613 is a duplicate retry which was logged only 2us later (though this time difference may be capture related). Then there's another retry some 1.9ms later (packet 26616). Approximately 13.7ms later packet 26617 has dropped to 1Mb/s which takes 12.6ms to transmit. ~12.7ms later it tries again at 1Mb/s (packet 26618). ~13.4ms later it tries again (packet 26619). ~11.4ms later packet 26628 ~37.2ms later packet 26658 ~10.5ms later packet 26659 ~14.5ms later packet 26663

The module seems to be retrying the Wifi transmits very quickly. This doesn't seem right?

borkcapture01.zip

Mateusz-Gwara commented 5 years ago

The frames have a correct crc so they should NOT be retransmitted. The only thing that I miss is the response from the destination MAC.

Just to make clear because when sniffing WiFi you basically record all the traffic on your current channel: The device your WILC1000 (MAC f8:f0:05:d0:9e:be) tries to transmit data (sequence number 2928) to, is a HP device with this MAC: 1c:c1:de:34:01:97

Is it possible, that this is a printer that is in stand by?

I believe that's why the frame gets retransmitted, because i can't find a frame with an ACK set in the FCF (1101) between those retransmits.

Nevertheless, this does not necessarily mean, that this is ok the way it is. I would need to check back on that.

HDC67 commented 5 years ago

Thanks a lot for taking a look @ZonKnoZ

To clarify a few things: 1c:c1:de :34:01:97 is a HP workstation and that's a wired Ethernet MAC address f8:f0:05:d0:9e:be is doing an iperf transmitting to the HP workstation (HP is the iperf server). In between the two is a TP-Link access point with MAC address ac:84:c6:75:1f:e2 with a wired Ethernet connection to the HP workstation through a single switch.

We are capturing in the air using an RTL8812AU device so seems we are missing the WiFi ACKs. However my knowledge of WiFi acknowledgement process is limited.

So no, not a printer in standby and not on wireless. The Wifi acknowledgements seem to be missing as you say and as such the module keeps repeating.

However the module seems to be retrying very quickly given the time taken to transmit the packet. There's very little room for AP to acknowledge. I'm not sure if this is correct behaviour.

Mateusz-Gwara commented 5 years ago

Thank you for making things clear. I'm trying to reproduce this scenario because what I encounter is a complete freeze of my system when running with the wilc_spi module active.

In this state I can see the number of interrupts fom the wilc interrupt pin going up (the same ammount as the ones from the corresponding gpio controller) but counting traffic or even a simple "ifconfig" simply freezes my console (i.e. "ifconfig &" never finishes).

root@casenio:~# cat /proc/interrupts CPU0 16: 25871159 INTC 68 Level gp_timer 18: 0 INTC 3 Level arm-pmu 20: 583020 INTC 12 Level 49000000.edma_ccint 22: 3328 INTC 14 Level 49000000.edma_ccerrint 26: 0 INTC 96 Level 44e07000.gpio 27: 0 INTC 98 Level 4804c000.gpio 28: 115812 INTC 32 Level 481ac000.gpio 29: 0 INTC 62 Level 481ae000.gpio 30: 13380 INTC 72 Level OMAP UART0 31: 2043721 INTC 73 Level OMAP UART1 33: 669 INTC 46 Level 34: 1633172 INTC 70 Level 44e0b000.i2c 35: 13 INTC 64 Level mmc0 36: 2035132 INTC 28 Level mmc1 44: 0 INTC 75 Level rtc0 45: 47 INTC 76 Level rtc0 47: 0 INTC 16 Level 44e0d000.tscadc:adc 50: 0 INTC 111 Level 48310000.rng 51: 0 44e07000.gpio 21 Edge NOTRUFTASTER 63: 115812 481ac000.gpio 24 Level WILC_IRQ Err: 0 root@casenio:~# rmmod wilc_spi & [4] 23991 root@casenio:~# Error: Module wilc_spi is in use ^C [4]+ Exit 1 rmmod wilc_spi root@casenio:~# rmmod -f wilc_spi & [4] 23992 root@casenio:~# libkmod: ERROR ../libkmod/libkmod-module.c:753 kmod_module_remove_module: could not remove 'wilc_spi': Device or resource busy Error: could not remove module wilc_spi: Device or resource busy

I also noticed that probe requests are replied to several times with the same sequence number (doesn't seem to be the case in your capture).

wilc3k-probe-response.zip

I'm aware, that a probe response might be the most unreliable example as clients scanning the whole rf range won't necessarily recieve the response in time before changing channels to send an ack.

MTaulin commented 5 years ago

I got an similar issue to this one. I do an iperf3 from my computer(client) to an custom imx6 board running the WILC1000 device(server) and after some time (not that short like at @ShonkyCH 's case) the connection getting incredibly slow. Im using an tp link TL-WR940N as accesspoint. After some time i can see that the bitrate is constantly falling from about 30 Mbits/sec to about 2Mbits/sec or less (see first attached screenshot). In this case it was just for few seconds but i also got this state for about several minutes or until stopping iperf3. So sometimes it getting back to normal by itself and sometimes not. On the computer it looks like shown in the second attached screenshot. I also checked with wireshark on the pc and i saw that acknowledges from the tcp packages are not sent from the client in time. After some seconds the acknowledges of more packages receiving all at once. This leads to that behavior of the pc sending packages to the client with some breaks in between. Also on screenshot one you can see that the device continuously receiving packages while the pc isn't sending at this time. But the amount is also quite less than sent. I checked the output of the packages via wireshark so they got send out correctly. Also there is not much traffic on the router which could lead to this. There are no lost packages i got all acknowledges just later than expected. So i guess there is something in the driver/firmware which is holding back the packages (buffering / processing) which leads to the late acknowledges and so to the slow bitrate. In the opposite way i don have this Problem (sending from the board to the pc).

client_board pc_to_board