qca / open-ath9k-htc-firmware

The firmware for QCA AR7010/AR9271 802.11n USB NICs
Other
434 stars 183 forks source link

watchdog problems #25

Open olerem opened 11 years ago

olerem commented 11 years ago

I started new issue related to watchdog. Other issue may relate to this one or may not. Until we get some confirmation, they are not ;) Now after UART seems to work OK and we have DBG interface. I did some tests with datchdog. Use "wdt rst" command to force watchdog initiated reboot. After this reboot i got this log:

>wdt rst
 reseting...
 - Boot from SFLASH, CDROM Bootcode version - 1.1.11

At this step it stops. It looks like Bootcode from SFLASH can't handle it. If i start this adapter from power supply, without usb datapins connected, i get this:

 - Boot from SFLASH, CDROM Bootcode version - 1.1.11
6. usb_hclk rdy
 Magpie_init 
8. wait for read/read_comp.... 
olerem commented 11 years ago

Here some more experiments with it. I add printks to urb handling functions: [31498.324829] ath9k_hif_usb_rx_cb --- OK [31498.417871] hif_usb_regout_cb --- OK [31498.478724] hif_usb_tx_cb --- OK --- here we got watchdog reset [31508.227795] ath: phy12: Unable to remove station entry for: 74:31:70:xx:xx:xx [31510.328868] hif_usb_tx_cb --- -ENOENT ... after this moment it can take long time untill some thing will happen. Only way to bring host system back is to unplug usb adapter.

I assume we should do some thing different on hif_usb_tx_cb if -ENOENT happens. Any usb experts here?

olerem commented 11 years ago

For some reason exception reset works better then reset initiated by watchdog. Adapter was able to start from EEPROM, but it can't load firmware.

erikarn commented 11 years ago

Can you be more specific here about what happens with each? I don't know how to read "able to start from EEPROM" - is that the watchdog? The exception reset? etc.

olerem commented 11 years ago

So, this cpus have, like other, two stages boot sequence:

Only after bootloader configured cpu, we will be able to upload firmware.

This CPU has watchdogtimer. Our firmware make use of it. If whatchdog is triggered, it will restart cpu, so it will go to 1. stage - bootlaoder (which is located on eeprom). Like some other CPU, for example MIPS, this one can trigger exception on some other events. For example on memory access. I assume it is related to kernel mode and user mode access. If i try from dbg_api read some memory register, which is for kernel mode, i do it probably from user mode. This will generate exception and cpu will reboot. In second case, it will start not from default address on eeprom, but at some offset, which is designed to print cpu register and then continue normal boot.

Now to what i wrote in my previous posts. If CPU will do restart initiated by watchdog, it will hang some where on first stage. In this case we wont be able to upload firmware. I think it is a bootloader problem... not a CPU. If CPU will do restart initiated by memory access exception, 1. stage will be able to start and get ready for firmware upload.

In both cases ath9k_htc is not ready to handle it.

Probably best solution is to move watchdog to the driver. If we will get timeout, driver will should poweroff adapter and reinit it.

erikarn commented 11 years ago

Right, this makes sense. It's a shame that the ROM code hangs from a watchdog timeout though. Sigh.

Can you document how you reproduced it with a bit of code? That way others can repeat/retest with your patches.

If this is truely a bug that we can't fix, then we should add it to an errata list somewhere so others don't bump into it.

And yes you're right. The timeout should cause the port to be powered off/on if possible.

Adrian

olerem commented 11 years ago

It can be easily reproduced without extra code from dbg interface. But since most people do not like to play with hardware we should probably provide WMI command for dbg interface.

KrasnikovEugene commented 11 years ago

I am not sure if the problem i see has something to do with WD but let me share what i can reproduce. Issue is similar to https://dev.openwrt.org/ticket/9654#no1

after data transmitting for couple of hours i see the following kernel logs: [27858.993043] cfg80211: All devices are disconnected, going to restore regulatory settings [27858.993050] cfg80211: Restoring regulatory settings [27858.993057] cfg80211: Calling CRDA to update world regulatory domain [27858.995057] cfg80211: Ignoring regulatory request Set by core since the driver uses its own custom regulatory domain [27858.995062] cfg80211: World regulatory domain updated: [27858.995064] cfg80211: (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) [27858.995068] cfg80211: (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) [27858.995071] cfg80211: (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) [27858.995074] cfg80211: (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) [27858.995077] cfg80211: (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) [27858.995080] cfg80211: (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) [27859.303065] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x02000020 DMADBG_7=0x00008040 [27859.559286] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x02000020 DMADBG_7=0x00008040 [27860.920581] wlan4: authenticate with 54:04:a6:8f:e9:ce [27861.038104] wlan4: send auth to 54:04:a6:8f:e9:ce (try 1/3) [27861.240066] wlan4: send auth to 54:04:a6:8f:e9:ce (try 2/3) [27861.449927] wlan4: send auth to 54:04:a6:8f:e9:ce (try 3/3) [27861.659837] wlan4: authentication with 54:04:a6:8f:e9:ce timed out [27862.019161] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x02000020 DMADBG_7=0x00008040

Issue is 100% reproducible after transferring data using iperf or any data generator for couple of hours. Will keep investigating this problem.

erikarn commented 11 years ago

Does it recover from this?

This is a DMA error, similar to what non-htc nics see.

adrian

On 8 May 2013 08:19, KrasnikovEugene notifications@github.com wrote:

I am not sure if the problem i see has something to do with WD but let me share what i can reproduce. Issue is similar to https://dev.openwrt.org/ticket/9654#no1

after data transmitting for couple of hours i see the following kernel logs: [27858.993043] cfg80211: All devices are disconnected, going to restore regulatory settings [27858.993050] cfg80211: Restoring regulatory settings [27858.993057] cfg80211: Calling CRDA to update world regulatory domain [27858.995057] cfg80211: Ignoring regulatory request Set by core since the driver uses its own custom regulatory domain [27858.995062] cfg80211: World regulatory domain updated: [27858.995064] cfg80211: (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) [27858.995068] cfg80211: (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) [27858.995071] cfg80211: (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) [27858.995074] cfg80211: (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm) [27858.995077] cfg80211: (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) [27858.995080] cfg80211: (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm) [27859.303065] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x02000020 DMADBG_7=0x00008040 [27859.559286] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x02000020 DMADBG_7=0x00008040 [27860.920581] wlan4: authenticate with 54:04:a6:8f:e9:ce [27861.038104] wlan4: send auth to 54:04:a6:8f:e9:ce (try 1/3) [27861.240066] wlan4: send auth to 54:04:a6:8f:e9:ce (try 2/3) [27861.449927] wlan4: send auth to 54:04:a6:8f:e9:ce (try 3/3) [27861.659837] wlan4: authentication with 54:04:a6:8f:e9:ce timed out [27862.019161] ath: phy0: DMA failed to stop in 10 ms AR_CR=0x00000024 AR_DIAG_SW=0x02000020 DMADBG_7=0x00008040

Issue is 100% reproducible after transferring data using iperf or any data generator for couple of hours. Will keep investigating this problem.

— Reply to this email directly or view it on GitHubhttps://github.com/qca/open-ath9k-htc-firmware/issues/25#issuecomment-17612599 .

olerem commented 11 years ago

I think we should keep this two issue separate for now. Currently i can't reproduce this crash. I tested ar7010+ar9280, it was streaming data two days. And now i'm testing ar9172. Running time is: [ 3] 46885.0-46890.0 sec 32.9 MBytes 55.3 Mbits/sec and it is still working. I think the problem should be in error handling or rate control. I have almost no concurrent stations nearby, so probably this is why it works for me.

erikarn commented 11 years ago

Agreed.

adrian

On 8 May 2013 22:13, Oleksij Rempel notifications@github.com wrote:

I think we should keep this two issue separate for now. Currently i can't reproduce this crash. I tested ar7010+ar9280, it was streaming data two days. And now i'm testing ar9172. Running time is: [ 3] 46885.0-46890.0 sec 32.9 MBytes 55.3 Mbits/sec and it is still working. I think the problem should be in error handling or rate control. I have almost no concurrent stations nearby, so probably this is why it works for me.

— Reply to this email directly or view it on GitHubhttps://github.com/qca/open-ath9k-htc-firmware/issues/25#issuecomment-17649889 .

KrasnikovEugene commented 11 years ago

Nic does not recover from DMA issue. May be the problem is with my particular nic. I'll keep investigating the problem. Agree to trace those issues separately.

erikarn commented 11 years ago

Ok. I've discovered some erm, "interesting" DMA failure modes on the PCI/PCIe NICs when random crap is scribbled into the TXDP/link pointers.

SO it may actually be a bug in the firmware and host code.

Adrian

On 9 May 2013 23:02, KrasnikovEugene notifications@github.com wrote:

Nic does not recover from DMA issue. May be the problem is with my particular nic. I'll keep investigating the problem. Agree to trace those issues separately.

— Reply to this email directly or view it on GitHubhttps://github.com/qca/open-ath9k-htc-firmware/issues/25#issuecomment-17705660 .

KrasnikovEugene commented 11 years ago

I looked into registers and this is their description:

AR_DIAG_SW=0x02000020 Bit 5 and 25 is set. 5 bit means HALT_RX. 25 bit means FORCE_RX_ABORT.

AR_CR=0x00000024 Bit 5 and 2 is set. Bit 5 means Rx disable AR_CR_RXD. Bit 2 means RX enable AR_CR_RXE. DMADBG_7=0x00008040 There is no information about register DMADBG_7. Felix added some changes with this commit: https://patchwork.kernel.org/patch/694461/

Will contact Felix and try to get description of the register DMADBG_7 and idea behind the patch.

erikarn commented 11 years ago

HI,

the DIAG_SW contents is about right for trying to do a DMA abort.

AR_CR show that the RX engine didn't stop.

DMADBG_* are debug registers whose values change depending upon how you set up some debug register. They're a "window" into the DMA engine inside the MAC.

Adrian

On 15 May 2013 00:23, KrasnikovEugene notifications@github.com wrote:

I looked into registers and this is their description:

AR_DIAG_SW=0x02000020 Bit 5 and 25 is set. 5 bit means HALT_RX. 25 bit means FORCE_RX_ABORT.

AR_CR=0x00000024 Bit 5 and 2 is set. Bit 5 means Rx disable AR_CR_RXD. Bit 2 means RX enable AR_CR_RXE. DMADBG_7=0x00008040 There is no information about register DMADBG_7. Felix added some changes with this commit: https://patchwork.kernel.org/patch/694461/

Will contact Felix and try to get description of the register DMADBG_7 and idea behind the patch.

— Reply to this email directly or view it on GitHubhttps://github.com/qca/open-ath9k-htc-firmware/issues/25#issuecomment-17923696 .