Mesh interfaces occationally break on boot

seamustuohy commented 10 years ago

This intermittent bug does not cause any processes to fail but will stop the node from sending any mesh packets over the network. I believe that this is associated with many other bugs we have been seeing. This is most easily recreated by re-naming the mesh ssid in the basic user interface but does not seem to be associated with the name given or length. Though, with that said, and the rule of large numbers being what it is I have seen the first instance using a string that is greater than nine characters with a number as the last character and a dash in the middle. It is not required, nor does it always work... but, it works more consistently than others.

cc: @jheretic as this is most likely a issue with hotplug/netifd and the kernel as shown in logs further down.

This is most easily identified as broken by looking at httpinfo and checking the Destination Gateway. It will have all 0's on the netmask.

$ netcat localhost 2006
> hit return again to have it return the data

Here are some examples from various logs I have taken.

fakeUser@fakeTerm:~/temp$ grep -E -A 2 "Table\:\sHNA" *

8charwork001:Table: HNA
8charwork001-Destination    Gateway
8charwork001-10.138.149.0/24    100.120.138.149
--
8xbroke001:Table: HNA
8xbroke001-Destination  Gateway
8xbroke001-10.138.149.0/24  0.0.0.0
--
9charbroke001:Table: HNA
9charbroke001-Destination   Gateway
9charbroke001-10.138.149.0/24   0.0.0.0
--
9xbroke001:Table: HNA
9xbroke001-Destination  Gateway
9xbroke001-10.138.149.0/24  0.0.0.0
--
broken_sw001:Table: HNA
broken_sw001-Destination    Gateway
broken_sw001-10.138.149.0/24    0.0.0.0
--
debug_dnsmasq:Table: HNA
debug_dnsmasq-Destination   Gateway
debug_dnsmasq-10.138.149.0/24   0.0.0.0
--
debug_NC_B002:Table: HNA
debug_NC_B002-Destination   Gateway
debug_NC_B002-10.138.149.0/24   0.0.0.0
--
debug_NC_W001:Table: HNA
debug_NC_W001-Destination   Gateway
debug_NC_W001-10.138.149.0/24   100.120.138.149
--
debugW001:Table: HNA
debugW001-Destination   Gateway
debugW001-10.138.149.0/24   100.120.138.149

Best I can tell the kernel seems to be failing and not triggering a new scan to find a IBSS to join on instances where this occurs. See the "kern.info" messages in the "good" logfiles below. The bad logfiles are just the same area in the logfile. The first "BAD" logfile shows the last time that there are kernel messages in the bad logfiles. I assume this is where the kernel error actually exists. The final "GOOD" log section shows what I assume to be the set of commands that the kernel is missing in bas restarts.

BAD
===============
Jan 16 14:55:48 commotion daemon.info avahi-daemon[1566]: Withdrawing address record for fe80::a2f3:c1ff:fef8:8a95 on wlan0.
Jan 16 14:55:48 commotion kern.info kernel: [ 1371.490000] ADDRCONF(NETDEV_UP): wlan0-1: link is not ready
Jan 16 14:55:49 commotion kern.info kernel: [ 1372.710000] br-lan: port 2(wlan0) entered disabled state
Jan 16 14:55:50 commotion daemon.notice netifd: meshtest (10924): commotion.proto: Running protocol handler.
================

GOOD
================
Jan 16 14:53:22 commotion daemon.info avahi-daemon[1566]: Withdrawing address record for fe80::a2f3:c1ff:fef8:8a95 on wlan0.
Jan 16 14:53:22 commotion kern.info kernel: [ 1225.920000] ADDRCONF(NETDEV_UP): wlan0-1: link is not ready
Jan 16 14:53:23 commotion kern.info kernel: [ 1227.160000] br-lan: port 2(wlan0) entered disabled state
Jan 16 14:53:24 commotion kern.info kernel: [ 1227.270000] wlan0-1: Trigger new scan to find an IBSS to join
Jan 16 14:53:24 commotion daemon.notice netifd: meshtest (10006): commotion.proto: Running protocol handler.
================

BAD
=================
Jan 16 14:55:50 commotion daemon.notice netifd: meshtest (10924): commotion.proto: proto_add_dns_search: 
Jan 16 14:55:50 commotion user.notice commotion.proto: proto_add_dns_search: 
Jan 16 14:55:51 commotion daemon.notice netifd: meshtest (10924): meshtest(): Interface type not supported
Jan 16 14:55:51 commotion user.notice dnsmasq: DNS rebinding protection is active, will discard upstream RFC1918 responses!
Jan 16 14:55:51 commotion user.notice dnsmasq: Allowing 127.0.0.0/8 responses
Jan 16 14:55:51 commotion daemon.notice netifd: meshtest (10924): meshtest(): Interface type not supported
Jan 16 14:55:51 commotion daemon.notice netifd: meshtest (10924): commotion.proto: Sending update for meshtest
=================

GOOD
===============
Jan 16 14:53:25 commotion user.notice commotion.proto: proto_add_dns_search: 
Jan 16 14:53:26 commotion kern.info kernel: [ 1229.320000] wlan0-1: Trigger new scan to find an IBSS to join
Jan 16 14:53:26 commotion user.notice dnsmasq: DNS rebinding protection is active, will discard upstream RFC1918 responses!
Jan 16 14:53:26 commotion user.notice dnsmasq: Allowing 127.0.0.0/8 responses
Jan 16 14:53:26 commotion daemon.notice netifd: meshtest (10006): meshtest(): Interface type not supported
Jan 16 14:53:26 commotion daemon.notice netifd: meshtest (10006): meshtest(): Interface type not supported
Jan 16 14:53:26 commotion daemon.notice netifd: meshtest (10006): commotion.proto: Sending update for meshtest
==============

bad
=============
Jan 16 14:55:53 commotion user.notice commotion.hotplug.olsrd: meshed: 1
Jan 16 14:55:53 commotion user.notice commotion.hotplug.olsrd: announced: 0
Jan 16 14:55:55 commotion daemon.info dnsmasq[11222]: started, version 2.66 cachesize 150
=============

good
===============
Jan 16 14:53:27 commotion user.notice commotion.hotplug.olsrd: meshed: 1
Jan 16 14:53:27 commotion user.notice commotion.hotplug.olsrd: announced: 0
Jan 16 14:53:28 commotion kern.info kernel: [ 1231.370000] wlan0-1: Trigger new scan to find an IBSS to join
Jan 16 14:53:29 commotion daemon.info dnsmasq[10304]: started, version 2.66 cachesize 150
=================

BAD
Jan 16 14:55:55 commotion daemon.info dnsmasq-dhcp[11222]: read /etc/ethers - 0 addresses
Jan 16 14:55:56 commotion daemon.info olsrd[11240]: Writing '1' (was 1) to /proc/sys/net/ipv4/ip_forward

GOOD
Jan 16 14:53:29 commotion daemon.info dnsmasq-dhcp[10304]: read /etc/ethers - 0 addresses
Jan 16 14:53:30 commotion kern.info kernel: [ 1233.420000] wlan0-1: Trigger new scan to find an IBSS to join
Jan 16 14:53:30 commotion daemon.info olsrd[10318]: Writing '1' (was 1) to /proc/sys/net/ipv4/ip_forward

FINAL BAD
============
Jan 16 14:55:56 commotion daemon.info olsrd[11240]: Writing '0' (was 0) to /proc/sys/net/ipv4/conf/all/rp_filter
Jan 16 14:56:01 commotion daemon.info olsrd[11240]: olsr.org -  0.6.5.4-git_4c19cba-hash_3667acb4ad7e32204039db1f6b9bc660  - successfully started
Jan 16 14:56:01 commotion cron.info crond[729]: crond: USER root pid 11315 cmd /usr/bin/commotion-bigboard-send
Jan 16 14:56:01 commotion cron.info crond[729]: crond: USER root pid 11317 cmd /usr/sbin/ff_olsr_test_gw.sh
===========

FINAL GOOD !!!!!!
======================
Jan 16 14:53:30 commotion daemon.info olsrd[10318]: Writing '0' (was 0) to /proc/sys/net/ipv4/conf/all/rp_filter
Jan 16 14:53:31 commotion kern.info kernel: [ 1234.280000] wlan0-1: Creating new IBSS network, BSSID b2:d5:ac:6c:01:01
Jan 16 14:53:31 commotion kern.info kernel: [ 1234.290000] ADDRCONF(NETDEV_CHANGE): wlan0-1: link becomes ready
Jan 16 14:53:31 commotion kern.info kernel: [ 1234.500000] br-lan: port 2(wlan0) entered forwarding state
Jan 16 14:53:31 commotion kern.info kernel: [ 1234.510000] br-lan: port 2(wlan0) entered forwarding state
Jan 16 14:53:31 commotion kern.info kernel: [ 1234.520000] ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
Jan 16 14:53:32 commotion daemon.info avahi-daemon[1566]: Registering new address record for fe80::a0f3:c1ff:fef8:8a95 on wlan0-1.*.
Jan 16 14:53:33 commotion kern.info kernel: [ 1236.510000] br-lan: port 2(wlan0) entered forwarding state
Jan 16 14:53:35 commotion daemon.info olsrd[10318]: olsr.org -  0.6.5.4-git_4c19cba-hash_3667acb4ad7e32204039db1f6b9bc660  - successfully started
Jan 16 14:53:35 commotion daemon.info olsrd[10318]: Writing '0' (was 1) to /proc/sys/net/ipv4/conf/wlan0-1/send_redirects
Jan 16 14:53:35 commotion daemon.info olsrd[10318]: Writing '0' (was 0) to /proc/sys/net/ipv4/conf/wlan0-1/rp_filter
Jan 16 14:53:35 commotion daemon.info olsrd[10318]: Adding interface wlan0-1
Jan 16 14:53:35 commotion daemon.info olsrd[10318]: New main address: 100.120.138.149
Jan 16 14:54:01 commotion cron.info crond[729]: crond: USER root pid 10398 cmd /usr/bin/commotion-bigboard-send
Jan 16 14:54:01 commotion cron.info crond[729]: crond: USER root pid 10399 cmd /usr/sbin/ff_olsr_test_gw.sh
============================

jheretic commented 10 years ago

Can you take a look at dmesg during your test for any possible kernel panics from the wireless driver?

seamustuohy commented 10 years ago

There was this crazy stack trace that only existed in the broken one.... That what you are looking for? I have a bunch of dmesg logs from various tests if it is a loose end.

[  514.240000] br-lan: port 2(wlan0) entered forwarding state
[  514.780000] ADDRCONF(NETDEV_UP): wlan0-1: link is not ready
[  516.020000] ------------[ cut here ]------------
[  516.030000] WARNING: at /mnt/build_tree_1/commotion-router/openwrt/build_dir/linux-ar71xx_generic/compat-wireless-2013-06-27/net/mac80211/chan.c:218 0x8712638c()
[  516.040000] Modules linked in: ath79_wdt ledtrig_usbdev ledtrig_netdev ip6t_REJECT ip6t_LOG ip6t_rt ip6t_hbh ip6t_mh ip6t_ipv6header ip6t_frag ip6t_eui64 ip6t_ah ip6table_raw ip6_queue ip6table_mangle ip6table_filter ip6_tables nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp xt_HL xt_hl xt_ecn ipt_ECN xt_CLASSIFY xt_time xt_tcpmss xt_statistic xt_mark xt_length xt_DSCP xt_dscp xt_string xt_layer7 ipt_MASQUERADE iptable_nat nf_nat xt_recent xt_helper xt_connmark xt_connbytes pppoe xt_conntrack xt_CT xt_NOTRACK iptable_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack pppox ipt_REJECT xt_TCPMSS ipt_LOG xt_comment xt_multiport xt_mac xt_limit iptable_mangle iptable_filter ip_tables xt_tcpudp x_tables ifb ipip tunnel4 ppp_async ppp_generic slhc ath9k(O) ath9k_common(O) ath9k_hw(O) ath(O) mac80211(O) ts_fsm ts_bm ts_kmp crc_ccitt ipv6 cfg80211(O) compat(O) arc4 aes_generic ohci_hcd ehci_hcd usbcore usb_common nls_base crypto_algapi ledtrig_timer ledtrig_default_on leds_gpio gpio_button_hotplug(O)
[  516.140000] Call Trace:[<80270b8c>] 0x80270b8c
[  516.140000] [<80270b8c>] 0x80270b8c
[  516.140000] [<80071a8c>] 0x80071a8c
[  516.150000] [<8712638c>] 0x8712638c
[  516.150000] [<80071ad0>] 0x80071ad0
[  516.150000] [<8712638c>] 0x8712638c
[  516.160000] [<87126884>] 0x87126884
[  516.160000] [<87121e48>] 0x87121e48
[  516.170000] [<870e37d4>] 0x870e37d4
[  516.170000] [<87126dcc>] 0x87126dcc
[  516.170000] [<87115b44>] 0x87115b44
[  516.180000] [<8705f958>] 0x8705f958
[  516.180000] [<8020ad7c>] 0x8020ad7c
[  516.180000] [<8009c688>] 0x8009c688
[  516.190000] [<8020abb0>] 0x8020abb0
[  516.190000] [<8020a18c>] 0x8020a18c
[  516.190000] [<8020aba0>] 0x8020aba0
[  516.200000] [<80209ae4>] 0x80209ae4
[  516.200000] [<801e04b8>] 0x801e04b8
[  516.210000] [<80209ed0>] 0x80209ed0
[  516.210000] [<802532fc>] 0x802532fc
[  516.210000] [<80207fa8>] 0x80207fa8
[  516.220000] [<8006316c>] 0x8006316c
[  516.220000] [<801d7c54>] 0x801d7c54
[  516.220000] [<801e4730>] 0x801e4730
[  516.230000] [<80252ee0>] 0x80252ee0
[  516.230000] [<800c2b1c>] 0x800c2b1c
[  516.230000] [<801e37fc>] 0x801e37fc
[  516.240000] [<800e5e94>] 0x800e5e94
[  516.240000] [<801d8c90>] 0x801d8c90
[  516.250000] [<800c557c>] 0x800c557c
[  516.250000] [<800c59d4>] 0x800c59d4
[  516.250000] [<8006c0c4>] 0x8006c0c4
[  516.260000] [<800e91e0>] 0x800e91e0
[  516.260000] [<800e96f4>] 0x800e96f4
[  516.260000] [<801da56c>] 0x801da56c
[  516.270000] [<80096ae0>] 0x80096ae0
[  516.270000] [<800e9f78>] 0x800e9f78
[  516.270000] [<800d70a4>] 0x800d70a4
[  516.280000] [<8006a284>] 0x8006a284
[  516.280000] 
[  516.280000] ---[ end trace 2415a5b7ab665c14 ]---

jheretic commented 10 years ago

Given the stack trace from dmesg, I believe this is the same issue with the driver freaking out when the radio's channel or other parameters are modified after the adhoc interface has been created (or possibly when an adhoc interface is present in combination with other VAPs). This has also possibly been a problem with the Linux client. I'm going to attempt to track down this bug once and for all, and either fix it or find out its exact triggers so that we can work around it effectively (I suspect that deleting and recreating the adhoc interface on any channel change may be an effective countermeasure).

anest commented 10 years ago

any progress yet? i just have same problem with code from trunk, can not move up, stuck. waiting for fix.

On Tue, Jan 21, 2014 at 7:01 AM, Josh King notifications@github.com wrote:

Given the stack trace from dmesg, I believe this is the same issue with the driver freaking out when the radio's channel or other parameters are modified after the adhoc interface has been created (or possibly when an adhoc interface is present in combination with other VAPs). This has also possibly been a problem with the Linux client. I'm going to attempt to track down this bug once and for all, and either fix it or find out its exact triggers so that we can work around it effectively (I suspect that deleting and recreating the adhoc interface on any channel change may be an effective countermeasure).

— Reply to this email directly or view it on GitHubhttps://github.com/opentechinstitute/commotion-router/issues/94#issuecomment-32892239 .

dismantl commented 10 years ago

@westbywest, have you seen anything like this before?

jheretic commented 10 years ago

Resolved in v1.1rc1.

opentechinstitute / commotion-router

Mesh interfaces occationally break on boot #94