Closed tsifb closed 5 years ago
Microchip Salesforce case number 00374724.
Thank you very much for your effort and insight!
We have basically the same issue rendering any WiFi Network in range unable to send data or even be seen, as soon as ~10 WILC3000 access points are switched on (even without traffic). And I'm talking about any Network, not just the WILC access points.
Our Microchip Salesforce case number is 00328657 (Date/Time Opened | 9/6/2018 10:39 AM) where we also supplied packet capture data. Unfortunately Microchip wasn't able to reproduce this until now. So, let's hope your input allows to shed some light on this issue!
If not, I'm considering reporting this "network jamming" feature to the German RF-network authorities so they can effectively ban the hardware from being sold. It's just 1.5K bucks. Maybe that will motivate enough to try to find anything ;)
@ZonKnoZ We would be interested in seeing your issue report details posted here too. Please share the details, including packet capture files, so that when Microchip eventually post a "fix" we can all test it knowing the likely failure modes.
I have suspected that this issue will be provoked by any AP in range, even with the WILC1000 stations not connected to it, however we have already spent vast amounts of time attempting to diagnose this problem. I do not have time to allocate to further digging.
It sounds like your experience has been similar to ours. This is a very simple setup to recreate and while the details of the problem are difficult to determine, the effect is very very obvious. Hopefully our combined efforts will motivate an appropriate response.
Thanks @tsifb for the detailed report. The salesforce team will analyze it and take appropriate action
@AdhamAbozaeid I see they have replied in Salesforce saying they will investigate. I hope that will be given high priority.
Are there any plans to open source the firmware for the WILC modules? While we do not really want to be inside the module firmware, it might be useful for our own wifi experts to be able to inspect the code. If not open source, is there possibility of getting access under NDA so we can assist directly?
Currently the issues with this wifi module are stalling our product release, and it is now critical path for us.
@tsifb , there are currently no plans to open source the firmware. Please mention the criticality and its impact on your production schedule in the ticket so that the salesforce team can prioritize it with respect to their input queue.
I can confirm that something happens when more than one WILC1000 device is in one wifi network. It's not that critical like mentioned from the threads above but i also get some suspicious packages when im using more than one wilc1000 device. I a use custom imx6 board with wilc1000 and Tp-link accesspoint. In my case im not sure about the packages because the "source" and "destination" are both from the tp-link but those always just happening when two or more devices with the wilc1000 are connected. The packages are CF-End control-frames and also coming in bursts in an interval of a few seconds but not that much to completely drop the data rate. Also i get some of those packages with 00:00:00_00:00:00 as source and destination address. I also connected two more devices into the wifi network and can see that the amount of those packages also raises so i expect that as the reason for those packages
@MTaulin your observations match ours.
Also, If you closely inspect the on-air traffic with a single WILC1000 active, there will still be erroneous packets and you will observe some incorrectly re-transmitted legitimate packets. This does slow the throughput from what the device is capable of. Using firmware and driver from more than 12 months ago, we saw significantly higher throughput - however there were other serious bugs that made that combination unusable.
We also observed packets with source and destination addresses that matched the AP (TP-link) We proved without doubt that these packets were NOT transmitted from the AP, but in fact were sent by the WILC1000 modules. You can prove this using a packet sniffer (we use rtl8812AU device). If it is located very close to the wilc1000, and distant from the AP, you will be able to observe different per packet RSSI per transmitter. Strong RSSI will indicate WILC1000 is transmitting. (radio tap header includes RSSI value)
@AdhamAbozaeid I have had no solid feedback via salesforce (7 days)
Can you raise this internally at Microchip to progress the investigating with high priority? @MTaulin's report shows the same symptoms. @ZonKnoZ's issue is likely closely related or the same.
@tsifb the support team is currently working on reproducing the issue and will get back to you soon with feedback. Thank you for your patience.
@tsifb
@rocky134 mentioned our problem in issue #36 as well
meanwhile we are making plans to drop support of the WILC3K altogether and switching to an usb dongle. The same setup we have simply works as you would expect WiFi to work with any other WiFi chip (atheros, Ralink, you name it).
The sad part is that we are going to ditch 1K+ WILC3000 chips due to this issue...
@tsifb you can contact me on my github email adress to discuss the issue After 1 Year of problems with this chip I simply don't expect things to magically change...
@ZonKnoZ , Work on WILC is prioritized according to the open tickets on salesforce. There have been a lot of improvements and bug fixes implemented since release 15.0, and more stability implemented on 15.2 (current dev branch) that's being released on February 11th. For the jamming issue, the salesforce team was able to reproduce the problem as I mentioned last week, and as soon as we find a proper fix it will be submitted to the dev branch.
Thanks, Adham
Thanks for the update. You as a developer have more current info on our tickets issue than we get via Salesforce.
This is very important info because we need to handle this situation and base our business decisions on something...
@ZonKnoZ Please share your findings on alternate parts. We are in the same position as you, and are considering alternatives as well. We so have a spare USB port that connects out through small daughter board which we could probably remake with an USB connected wifi chipset. For our application, we would need significant volume of parts for mass production, therefore a usb dongle arrangement is unlikely to be suitable. We also require both AP and Station mode operation, preferably concurrently. I have seen some alternatives using the Atheros AR9271, which I have used in a previous company / product. The driver did get some attention for our specific use, but overall it was very reliable and worked well. We would also consider an SDIO connected module, with some type of adapter board, or making our own module to be footprint compatible with the wilc1000 module - this is within our capabilities, however a late stage change such as this is very expensive to our business.
@AdhamAbozaeid, are you directly involved in or with the Engineering team working on this firmware issue? Unfortunately we do not see much movement in Saleforce. They only response in the last 15 days being that the team has reproduced this issue and is working on a fix.
We would appreciate some more frequent updates with some details of your progress towards a fix.
Everyone following this ticket understands that Salesforce tickets need to be prioritized. Can you tell us all what priority level has been placed in this issue?
Also, we are watching GitHub closely (both Driver and Firmware repos), currently running the latest from the dev branches, including the latest firmware 15.2 release candidate. We are able to update to any new commits and re-test very quickly, if this is of any assistance.
@tsifb Here is a dump I submitted to our Issue on salesforce on 11/5/2018 4:36 PM capture-of-l2-wilc3k-connection-filtered.zip
There are several WILC3000 APs active and I see quite a few retransmissions(starting with Frame 100).
Nevertheless it would be nice to have a software solution, we could distribute OTA instead of replacing hardware. We also are running the latest driver/FW commits from the dev branch.
The USB replacement ist just a measure to "fix" the problem of already produced units. We are validating alternatives (SPI/SDIO) for our next production run as well to be able to switch if this issue is still there until then.
@tsifb My team is responsible for the driver/FW development for WILC1000/WILC3000. We started working on this issue few days ago. I'll post updates once I have reasonable findings. Thanks for offering assistance. I'll let you know if I needed more logs or maybe to validate a fix
@ZonKnoZ the content of your capture is very very similar to my single unit capture. Notably, the burst of retransmitted packets starting from packet 100. Note the time interval between packet 100 and 101 is very short.
@AdhamAbozaeid note this characteristic. In this case, the WILC module does not send erroneous CF-END frame, but re-transmits a (valid?) packet several times (with the R flag set indicating a retry, and the same sequence number as the original).
While the on-air characteristics are different to the multi unit CF-END bursts, my experience and best guesswork tells me that this is likely the same root cause.
@tsifb , this is a normal behavior at packet 100. WILC will retransmit unicast packets if the receiver didn't send back an ACK frame
@AdhamAbozaeid what is the timeout in microseconds before a Probe Response retry is sent? Observation is that the WILC module sends retries after a much shorter period than other devices (as seen in this @ZonKnoZ 's capture)
Also, in my Single unit iperf capture, there are significant numbers of QOS-Data frame retries. Notes: 1, these are also very small deltas. 2, they are variable in number 3, mostly followed by a WILC module sending a CF-END frame Given these observation, I suspect this is very closely related to the flooding issue with multiple devices.
@ZonKnoZ can you tell us your setup to capture these packets, including the hardware device you used? I took some care to make sure that my channel was (relatively) quiet, and that the sniffer was only using a single channel. I suspect from looking at your log, that some packets are missed due to the sniffer scanning other channels. Notably there are not many ACK packets recorded. This was not easy to do. Something like: kill off Kali's Network Manager, manually setup wpa supplicant to a single channel scan, connect to the AP, then kill wpa_supplicant, set monitor mode, run wireshark.... I don't recall exactly the steps, but it took some work to get it right.
@tsifb I can see the CF sequence happening even after probe responses directed from the AP to other station, not only after re transmissions from WILC Also please note that at the level microseconds, the air sniffer will not be very accurate. You would need a high speed scope to get accurate interframe spacing times. On my sniffer, I see the retries varies from hundred microseconds to few milliseconds both for the AP and for WILC
@AdhamAbozaeid I understand the timing may not be accurate. I don't have immediate access to the hardware needed to get more accurate timings. After actually doing the calculations , I got around 340us for an Ack response time at 1Mbps.
What is the actual timeout value / calculation used in the WILC firmware?
I did notice the WILC management frames in @ZonKnoZ 's capture were at 6Mbps. I expected these to be 1Mbps. I'm not sure if this is another useful observation. Unsure of the exact configuration in his case. It might be worth reviewing the tx data rates used by the WILC module for the management frames.
Regarding the CF-End Frames, note the signal strength in the radio tap header. These are ALL very strong (-26 to -28dBm), so must have been transmitted by the WILC module. (Ignore the MAC addresses, which I believe are most likely incorrect) My understanding from @ZonKnoZ 's previous comment is that the WILC tx'ed CF-End frames will happen even if they are not connected to an AP. This probably indicates that frames directed to other stations could likely be triggering the WILC to tx these frames. I haven't done any testing on this specifically, but it is worthwhile looking at this.
Maybe configure a WILC to not actively probe, and configure it to connect to a non-existent SSID, single channel (which should mean the WILC is quiet). Monitor it to see if it can be triggered into sending erroneous frames by other activity on the same channel from a network using a different SSID?
I would like to help out directly with this, but I am unable to do so right now.
@tsifb
The chip I use for scanning is:
ID 0cf3:9271 Atheros Communications, Inc. AR9271 802.11n
The hostapd AP config is done as follows:
interface=wlan0 driver=nl80211 country_code=DE channel=".(int rand(8)+3)." auth_algs=1 hw_mode=g max_num_sta=16 wpa_group_rekey=7200 wpa_ptk_rekey=7200 wpa_gmk_rekey=86400 macaddr_acl=0 wpa=2 wpa_key_mgmt=WPA-PSK wpa_pairwise=CCMP rsn_pairwise=CCMP ssid=".$ssid." wpa_passphrase=".$pass."
I was scanning at work with 10 WILC APs active. I was fixed to one channel as described in these instructions: WLAN Sniffer.pdf ^^these were the steps Microchip support wanted me to perform in order to have a valid scan result
It is hard to find a spot with less than 50 acess points active in the neighbourhood and I have no faraday cage available but I might try to find a spot less noisy. Sorry for being so unresponsive right now, but we are quite busy atm. Still, I'm willing to help where I can in order to get this issue resolved.
@tsifb, the ACK timeout ranges between 33 to 207 usec. There shouldn't be a problem with using 6Mbps to transmit management frames. I even think 1Mbps would be more prune to collision, and hence might required CFP.
I have also changed the location of WILC a few times and I can see the RSSI captured by the sniffer changes proportionally, so I agree that the packets might be coming out of WILC
Given the relative RSSI difference in my captures, it is proven without doubt that the packet comes from the WILC module.
I see @ZonKnoZ 's configuration has hw_mode=g which explains the mgmt frames sent at the lowest g rate.
@ZonKnoZ I have used the AR9271 in another product. We considered testing it in this product - but decided on another device. We have samples enroute to us now. We intend to integrate its driver and initially use this a test comparison against the WILC, with some packet captures that we can all use as reference.
@AdhamAbozaeid today is day 26 since reporting this issue. Can I get an update on your progress so far, and an estimated date on a new release that includes a fix?
We have product waiting to to ship, and new production runs are on hold.
It's 190 days for me, since I communicated this issue towards microchip the first time.
I know that being asked about progress migt be stressful while coping with complex dev/debug tasks, but the same goes for waiting without any feedback.
@AdhamAbozaeid I'm really sorry that you have to cope with customers as well as with development at the same time.
It seems that you are a lone wolf @microchip handlig this, which I hope is just a false&provocative assumption causing an official statement on this matter that might help handling this from a business perspective...
@AdhamAbozaeid I see you just released the firmware 15.2 - does this include a fix for this issue? I don't see any additional commits.
@ZonKnoZ, @tsifb release 15.2 doesn't include a fix for this issue. This is the same as the dev branch that was under QA to be released.
I assure you that this issue is currently getting highest priority, so thanks for understanding. Usually higher layers of support at microchip takes care of handling customer issues, but this one looks too deep and has to involve multiple development teams.
So far, I was able to reproduce the problem, and I could confirm that these packets aren't generated nor received from the FW, same as all medium control packets that requires accurate timing that can't be handled by SW. For now, I'm working with the HW team to understand how these packets are generated, and if the FW is mis-configuring any options related to CF-end packets.
As for the estimate for a fix, it will be hard to give an accurate estimate given the current information I have for now, but I'd say it would take roughly 2\~3 weeks to find a fix, plus 2\~3 more weeks for QA sanity checks for an engineering release with a fix to be available. Once we have more information, I'll send you a fix for validation, and will get you more accurate information about the engineering release availability.
Why release firmware when this issue's root cause is still unknown? I find this unacceptable. Your QA Process is obviously insufficient if you are proceeding to release with such a serious issue outstanding. Especially one that has been reported by multiple customers and you have reproduced.
Let me be perfectly clear: The WILCx000 modules running this combination of driver and firmware are UNUSABLE in any real-world product.
Moving on... Have you attempted to run release 14.x firmware (or other pre-15.2 releases) and tried reproduce this problem? While we have not done this ourselves, we did see much higher throughput. This may be due to the erroneous packets not being sent, and would prove that recent firmware is at fault. This test would be valuable to understand the problem, however is not a good workaround for us due to other assorted issues that existed in earlier firmware versions.
I would appreciate a pre-QA binary so that I can test in parallel with your own QA. I can commit to testing an engineering release and providing feedback quickly.
Please keep us updated here regularly - at least weekly - with your progress... I have customers and internal stakeholders to keep updated.
Why release firmware when this issue's root cause is still unknown?
Because the new release has fixes for other issues that were already queued to be released, and which other customers have been waiting for. Also since the new release is technically better than the previous one, I don't see anything unacceptable in proceeding with the release, since it doesn't affect our investigation on this issue! Also, stressing over and over that the driver and firmware are UNUSABLE isn't changing how this issue is being handled. I confirmed several times that we acknowledge this issue, and are working on finding a fix with high priority. I also confirm that we have customers already in production that weren't affected much by this issue, since it needs a few WILCs in the same location for the packets to affect the connectivity. Let's focus more on understanding the root cause and finding a proper fix, since we are already aligned on the priority of the issue. Checking previous release is actually a good idea. So far we have observed this problem in releases 15.2 and 15.1. The only challenge here is that release 14.x is too old, and the changes happened since then are results of almost 3 years of development, but is still worthy to check. On my side, I haven't seen much drop on in our throughput tests which we run with 1 WILC device, so UPD downlink for example is in range of 45~47Mbps. Which release are you getting higher throughput with? how does the throughput numbers compare to 15.2? and what's your setup's details (number of WILCs, level of activity on each device, throughput on each device.. etc?
Once a fix is available, I'll send it to you to validate, and make sure it fixes the issue you are seeing as well, then proceed with the QA sanity checks and inform you about the engineering release schedule, and the official release afterwards.
Feb 13 02:51:05 9999000003 kernel: wilc_sdio mmc1:0001:1 wlan0: INFO [wilc_wlan_initialize]WILC Firmware Ver = WILC_WIFI_FW_REL_14_04 Build: 8556
We found a v14.04 firmware version and tested it. 1, It sends CF-End frames similar to v15.2. Confirmed with same packet sniffer + RSSI test as before. 2, throughput performance with just a single unit is similar to v15.2
I expected this version to not exhibit this same issue. We are going to re-confirm this setup. We will have some 15.2 firmware units operating within range. I will attempt to power all of those down an retest.
Our recollection of better throughput was probably due to only a small number of prototypes operating. Just one or two units were used during early development.
@AdhamAbozaeid Can I get an update on your progress, and an estimate on a fix.
@tsifb , we now have an understanding on how these re-transmissions happen, and are investigating more on the root cause, and a proper fix.
Good news! Thanks for the update!
Thanks for the update @AdhamAbozaeid !
Just compiled the latest dev driver for a 4.9.59 kernel Running with the latest dev firmware it at least seems to load. This comes up as the FW version:
[wilc_wlan_initialize]WILC Firmware Ver = WILC_WIFI_FW_REL_15_2_RC3 Build: 11045M
I started testing the setup and will try to replicate our previous problems. Uptime is still high after loading the wilc module though:
15:44:06 up 32 min, 1 user, load average: 2.93, 2.77, 2.19
with a single device traffic seems to flow ok:
wlan0 Link encap:Ethernet HWaddr f8:f0:05:c0:2a:1c inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:372 Metric:1 RX packets:287085 errors:0 dropped:0 overruns:0 frame:0 TX packets:549804 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:20031590 (19.1 MiB) TX bytes:210797809 (201.0 MiB)
Unfortunately the whole system froze shortly after transferring those ~200MB with no output I could capture leading to this.
Would the wilc debug port output help if i reproduce and record this?
The stability and congestion tests with multiple devices will take a bit more time.
@AdhamAbozaeid which kernels are officially supported (the Linux user guide describes only 4.9 for SAMA5D4)?
@ZonKnoZ , the changes submitted to the dev branch doesn't have fixes for the CF end problem, nor the uptime yet. Can you describe the steps you are doing to reproduce this freeze?
We use SAMA5d4 with kernel 4.9 for testing, but note that the dev branch is fully tested at official releases. All intermediate commits are sanity checked only.
All, the attached firmware has a tentative fix for the CF-End flood packets. You can use it to verify the test as we previously discussed. Please note that this FW is for testing purposes only, and isn't ready for release yet. wilc1000_wifi_firmware.zip wilc3000_wifi_firmware.zip
@AdhamAbozaeid , here we go:
interface=wlan0 hw_mode=g channel=3 ieee80211d=1 country_code=DE ieee80211n=1 wmm_enabled=1 auth_algs=1 wpa=2 wpa_key_mgmt=WPA-PSK rsn_pairwise=CCMP ssid=test123 wpa_passphrase=test123!
Then I just connect a tablet to the AP and start to stream a youtube video. Freeze comes after a while.
Btw. I'm aware that this is a bleeding edge test case here + I use a WILC3000 , so sorry but can't test it.
@ZonKnoZ , Do you see the same issue with the master, or on the dev only? Do you have the driver/FW logs? Also, please open a new ticket for this one, since it's not related to the CF End packets problem
We have built in the build 11050 firmware and are testing. So far, initial observation is that the flood of CF-END does not occur. Now testing with more units powered on, and using our typical application use-case.
@AdhamAbozaeid Are there additional issues / fixes that you are aware of that will be included in the next release?
Most of the changes will be in the driver to align with WILC's driver on the kernel staging tree, in addition to:
@AdhamAbozaeid My question was about the next firmware release, although if there are driver changes that need to also be included at the same time - that would be useful to know.
I am trying to assess which combinations we should be focusing on testing, and how much effort to burn on them.
I see there are a bunch of new driver dev branch commits. We are also working in integrating these into our latest internal dev branch, which is on a slightly newer linux kernel. We do not want to change too much - so we can retest as close as possible to the original setup to provide quality feedback to you, however our eventual target will include a newer kernel and important wilc driver updates.
Your input would be helpful on this.
As part of the testing, we will be carefully observing the debug output from the driver and WILC module serial port - if we notice anything suspicious, we will open new issues here to get your feedback.
Testing Update: Yesterday with 5 units connected to the AP has shown a large improvement in usability. I do have some concerns about performance where the AP to Station signal strength is lower (~-75dBm) - the throughput drops more than I expected, even at the reduced data rate. More investigation is needed here.
Also the ability to maintain a somewhat usable ssh connection to one unit, while a file transfer / iperf is operating with another unit is not what I expect. It seems like the file transfer (TP-link AP-> WILC Station) does significantly delay packets to other units (on the order of several seconds to 10s of seconds.) This may impact our factory use-case, where commands need to be remotely executed in units, while others are downloading large firmware images. This may be a separate issue, but I thought it was worth mentioning early.
Further testing in the next few day: 1, more units + throughput to one unit 2, more units + multiple concurrent high throughput operations ( file transfers ) 3, more units + remote shell + file transfers 4, concurrent WILC AP and WILC STA operation on one unit. 5, testing latest driver dev branch + linux kernel 4.14 During this testing we will be trying to get an understanding of reliability as well. (Power cycling in our end use application is a very rare event, and is difficult to perform)
WILC FW and driver releases are always aligned. So far, the changes done on top of 15.2 for the driver and firmware are independent, so you can use the fix for this issue with the driver from 15.2, without having to integrate driver changes from the dev branch
@AdhamAbozaeid did you build a new WILC3000 binary set too? @Mateusz-Gwara may be able to test as well?
@Mateusz-Gwara we would appreciate your eyes on this new firmware too, if that is possible ;)
@AdhamAbozaeid ok. thanks for that information.
@tsifb , I just added the WILC3000 FW as well to my comment above
Summary
The WILC1000 module is occasionally transmitting large volumes of erroneous packets in bursts even when a device should be almost completely idle. The problem is particularly more prevalent when more than one WILC1000 devices are associated with the same AP as if they are interacting with each other, and trigger this condition.
The setup is a TP-link or Netgear AP, with WILC1000 modules on custom PCBs connected to SAMA5D42 running linux kernel 4.9.52 and the linux4wilc driver.
The WILC1000 modules are configured only as Stations. (no AP is configured in the WILC1000)
The problem has been initially observed as intermittent poor throughput when loading ~100MB root filesystem images into the custom PCB over Wi-Fi. The problem appears to be much worse when multiple units are operating, even when only loading the file into one unit at a time.
We have made a very detailed analysis of the Wi-Fi behaviour, including using an independent Wi-Fi packet capture device (sniffer), a spectrum analyser, and monitoring the WILC1000 supply current to see when it transmits.
Analysis
Packet Capture file notes:
Setup:
Filename: multiplewilc1000-no-traffic.zip
Multiple WILC1000 stations connected to AP, with no other additional higher level traffic (no pings, no iperf, no file transfers)
Frame 54: This is correct addressing for CF-END frame, however most probably send from Other WILC1000 due to -82dBm signal strength, and is out of context (no other PCF packets seen)
Frame 55: First Erroneous CF-End Frame.
Frame 56: Second Erroneous CF-End frame.
Frame 57: Third Erroneous CF-End frame, however is sent my another WILC1000.
Frames 58 - 1893: alternating erroneous CF-End frames transmitted by WILC1000 unit, as seen by inspecting rssi values.
Throughout the rest of the capture log:
Filename: singlewilc1000-iperf-udp30M.zip
Packet capture of a single WILC1000 sending UDP frames from iperf.
When comparing the previous packet capture with a single WILC1000 connected to the SAME Access Point, the blocks of rapidly repeating CF-End frames do not occur, and observed file transfer throughput is much higher, although still not as high as should be possible.
There are lots of re-transmitted frames. for example, Frames 91 and 92 are re-transmissions of Frame 90.
Frame 93: Decodes as a CF-End frame, however is transmitted by the WILC1000 station. CF-End should only be sent from the AP. Note: RSSI is -26dBm, so must have been transmitted by WILC1000
CF-End frames are transmitted by WILC1000, however there are no other PCF packets sent. This indicates that PCF function is not being used, however WILC1000 seems to be transmitting CF-End frames.
Conclusion
When Mulitple WILC1000 Stations are close, they trigger each other's firmware into transmitting junk packets very rapidly. A single WILC1000 will not flood the network, however does send incorrect junk frames and will re-transmit many otherwise valid frames, sometimes so quickly that an acknowledgement can not have been received.
This issue renders the WILC1000 unusable, or at best - when a single Station is used - poor performance.
We have spent considerable engineering resources on investigation of this and other previous WILC1000 issues. We require urgent and dedicated attention from Microchip engineers to fix this serious issue.
A Salesforce issue will be opened referencing this report.