Closed seamustuohy closed 10 years ago
Can you take a look at dmesg during your test for any possible kernel panics from the wireless driver?
There was this crazy stack trace that only existed in the broken one.... That what you are looking for? I have a bunch of dmesg logs from various tests if it is a loose end.
[ 514.240000] br-lan: port 2(wlan0) entered forwarding state
[ 514.780000] ADDRCONF(NETDEV_UP): wlan0-1: link is not ready
[ 516.020000] ------------[ cut here ]------------
[ 516.030000] WARNING: at /mnt/build_tree_1/commotion-router/openwrt/build_dir/linux-ar71xx_generic/compat-wireless-2013-06-27/net/mac80211/chan.c:218 0x8712638c()
[ 516.040000] Modules linked in: ath79_wdt ledtrig_usbdev ledtrig_netdev ip6t_REJECT ip6t_LOG ip6t_rt ip6t_hbh ip6t_mh ip6t_ipv6header ip6t_frag ip6t_eui64 ip6t_ah ip6table_raw ip6_queue ip6table_mangle ip6table_filter ip6_tables nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp xt_HL xt_hl xt_ecn ipt_ECN xt_CLASSIFY xt_time xt_tcpmss xt_statistic xt_mark xt_length xt_DSCP xt_dscp xt_string xt_layer7 ipt_MASQUERADE iptable_nat nf_nat xt_recent xt_helper xt_connmark xt_connbytes pppoe xt_conntrack xt_CT xt_NOTRACK iptable_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack pppox ipt_REJECT xt_TCPMSS ipt_LOG xt_comment xt_multiport xt_mac xt_limit iptable_mangle iptable_filter ip_tables xt_tcpudp x_tables ifb ipip tunnel4 ppp_async ppp_generic slhc ath9k(O) ath9k_common(O) ath9k_hw(O) ath(O) mac80211(O) ts_fsm ts_bm ts_kmp crc_ccitt ipv6 cfg80211(O) compat(O) arc4 aes_generic ohci_hcd ehci_hcd usbcore usb_common nls_base crypto_algapi ledtrig_timer ledtrig_default_on leds_gpio gpio_button_hotplug(O)
[ 516.140000] Call Trace:[<80270b8c>] 0x80270b8c
[ 516.140000] [<80270b8c>] 0x80270b8c
[ 516.140000] [<80071a8c>] 0x80071a8c
[ 516.150000] [<8712638c>] 0x8712638c
[ 516.150000] [<80071ad0>] 0x80071ad0
[ 516.150000] [<8712638c>] 0x8712638c
[ 516.160000] [<87126884>] 0x87126884
[ 516.160000] [<87121e48>] 0x87121e48
[ 516.170000] [<870e37d4>] 0x870e37d4
[ 516.170000] [<87126dcc>] 0x87126dcc
[ 516.170000] [<87115b44>] 0x87115b44
[ 516.180000] [<8705f958>] 0x8705f958
[ 516.180000] [<8020ad7c>] 0x8020ad7c
[ 516.180000] [<8009c688>] 0x8009c688
[ 516.190000] [<8020abb0>] 0x8020abb0
[ 516.190000] [<8020a18c>] 0x8020a18c
[ 516.190000] [<8020aba0>] 0x8020aba0
[ 516.200000] [<80209ae4>] 0x80209ae4
[ 516.200000] [<801e04b8>] 0x801e04b8
[ 516.210000] [<80209ed0>] 0x80209ed0
[ 516.210000] [<802532fc>] 0x802532fc
[ 516.210000] [<80207fa8>] 0x80207fa8
[ 516.220000] [<8006316c>] 0x8006316c
[ 516.220000] [<801d7c54>] 0x801d7c54
[ 516.220000] [<801e4730>] 0x801e4730
[ 516.230000] [<80252ee0>] 0x80252ee0
[ 516.230000] [<800c2b1c>] 0x800c2b1c
[ 516.230000] [<801e37fc>] 0x801e37fc
[ 516.240000] [<800e5e94>] 0x800e5e94
[ 516.240000] [<801d8c90>] 0x801d8c90
[ 516.250000] [<800c557c>] 0x800c557c
[ 516.250000] [<800c59d4>] 0x800c59d4
[ 516.250000] [<8006c0c4>] 0x8006c0c4
[ 516.260000] [<800e91e0>] 0x800e91e0
[ 516.260000] [<800e96f4>] 0x800e96f4
[ 516.260000] [<801da56c>] 0x801da56c
[ 516.270000] [<80096ae0>] 0x80096ae0
[ 516.270000] [<800e9f78>] 0x800e9f78
[ 516.270000] [<800d70a4>] 0x800d70a4
[ 516.280000] [<8006a284>] 0x8006a284
[ 516.280000]
[ 516.280000] ---[ end trace 2415a5b7ab665c14 ]---
Given the stack trace from dmesg, I believe this is the same issue with the driver freaking out when the radio's channel or other parameters are modified after the adhoc interface has been created (or possibly when an adhoc interface is present in combination with other VAPs). This has also possibly been a problem with the Linux client. I'm going to attempt to track down this bug once and for all, and either fix it or find out its exact triggers so that we can work around it effectively (I suspect that deleting and recreating the adhoc interface on any channel change may be an effective countermeasure).
any progress yet? i just have same problem with code from trunk, can not move up, stuck. waiting for fix.
On Tue, Jan 21, 2014 at 7:01 AM, Josh King notifications@github.com wrote:
Given the stack trace from dmesg, I believe this is the same issue with the driver freaking out when the radio's channel or other parameters are modified after the adhoc interface has been created (or possibly when an adhoc interface is present in combination with other VAPs). This has also possibly been a problem with the Linux client. I'm going to attempt to track down this bug once and for all, and either fix it or find out its exact triggers so that we can work around it effectively (I suspect that deleting and recreating the adhoc interface on any channel change may be an effective countermeasure).
— Reply to this email directly or view it on GitHubhttps://github.com/opentechinstitute/commotion-router/issues/94#issuecomment-32892239 .
@westbywest, have you seen anything like this before?
Resolved in v1.1rc1.
This intermittent bug does not cause any processes to fail but will stop the node from sending any mesh packets over the network. I believe that this is associated with many other bugs we have been seeing. This is most easily recreated by re-naming the mesh ssid in the basic user interface but does not seem to be associated with the name given or length. Though, with that said, and the rule of large numbers being what it is I have seen the first instance using a string that is greater than nine characters with a number as the last character and a dash in the middle. It is not required, nor does it always work... but, it works more consistently than others.
cc: @jheretic as this is most likely a issue with hotplug/netifd and the kernel as shown in logs further down.
This is most easily identified as broken by looking at httpinfo and checking the Destination Gateway. It will have all 0's on the netmask.
Here are some examples from various logs I have taken.
Best I can tell the kernel seems to be failing and not triggering a new scan to find a IBSS to join on instances where this occurs. See the "kern.info" messages in the "good" logfiles below. The bad logfiles are just the same area in the logfile. The first "BAD" logfile shows the last time that there are kernel messages in the bad logfiles. I assume this is where the kernel error actually exists. The final "GOOD" log section shows what I assume to be the set of commands that the kernel is missing in bas restarts.