openwrt / packages

Community maintained packages for OpenWrt. Documentation for submitting pull requests is in CONTRIBUTING.md
GNU General Public License v2.0
3.94k stars 3.46k forks source link

aiccu can't always start or restart #330

Closed duvi closed 9 years ago

duvi commented 10 years ago

Aiccu doesn't restart for me after the ipv4 wan connection reconnects. Sometimes it doesn't connect on boot because of the same reason. "Couldn't resolve host tic.sixxs.net"

Sat Sep 20 11:32:12 2014 daemon.info pppd[1656]: LCP terminated by peer (Connect time expired)
Sat Sep 20 11:32:12 2014 daemon.info pppd[1656]: Connect time 10080.0 minutes.
Sat Sep 20 11:32:12 2014 daemon.info pppd[1656]: Sent 3413231816 bytes, received 808504468 bytes.
Sat Sep 20 11:32:12 2014 daemon.err miniupnpd[2487]: Failed to get IP for interface pppoe-wan
Sat Sep 20 11:32:12 2014 daemon.warn miniupnpd[2487]: SendNATPMPPublicAddressChangeNotification: cannot get public IP address, stopping
Sat Sep 20 11:32:12 2014 daemon.notice netifd: Network device 'pppoe-wan' link is down
Sat Sep 20 11:32:12 2014 daemon.notice netifd: Interface 'wan' has lost the connection
Sat Sep 20 11:32:12 2014 daemon.notice netifd: Interface 'wan6' has lost the connection
Sat Sep 20 11:32:12 2014 daemon.notice netifd: Network device 'aiccu-wan6' link is down
Sat Sep 20 11:32:12 2014 user.info syslog: Usage:
        modprobe module
Sat Sep 20 11:32:12 2014 user.info syslog: Usage:
        modprobe module
Sat Sep 20 11:32:12 2014 local7.err syslog: Already running instance HUP'ed, exiting
Sat Sep 20 11:32:12 2014 daemon.notice netifd: Interface 'wan6' is now down
Sat Sep 20 11:32:12 2014 daemon.notice netifd: Interface 'wan6' is setting up now
Sat Sep 20 11:32:12 2014 user.info syslog: Usage:
        modprobe module
Sat Sep 20 11:32:12 2014 user.info syslog: Usage:
        modprobe module
Sat Sep 20 11:32:12 2014 local7.err syslog: Couldn't resolve host tic.sixxs.net, service 3874
Sat Sep 20 11:32:12 2014 local7.err syslog: Couldn't connect to the TIC server tic.sixxs.net
Sat Sep 20 11:32:12 2014 local7.err syslog: Couldn't retrieve first tunnel for the above reason, aborting
Sat Sep 20 11:32:12 2014 daemon.notice netifd: Interface 'wan6' is now down
Sat Sep 20 11:32:15 2014 daemon.notice pppd[1656]: Connection terminated.
Sat Sep 20 11:32:15 2014 daemon.notice pppd[1656]: Modem hangup
Sat Sep 20 11:32:16 2014 daemon.warn dnsmasq[1988]: no servers found in /tmp/resolv.conf.auto, will retry
Sat Sep 20 11:32:45 2014 daemon.info pppd[1656]: PPP session is 1065
Sat Sep 20 11:32:45 2014 daemon.warn pppd[1656]: Connected to 00:25:90:e5:5e:a9 via interface eth1.2
Sat Sep 20 11:32:45 2014 daemon.info pppd[1656]: Using interface pppoe-wan
Sat Sep 20 11:32:45 2014 daemon.notice pppd[1656]: Connect: pppoe-wan <--> eth1.2
Sat Sep 20 11:32:48 2014 daemon.notice pppd[1656]: PAP authentication succeeded
Sat Sep 20 11:32:48 2014 daemon.notice pppd[1656]: peer from calling number 00:25:90:E5:5E:A9 authorized
Sat Sep 20 11:32:48 2014 daemon.notice pppd[1656]: local  IP address x.x.x.x
Sat Sep 20 11:32:48 2014 daemon.notice pppd[1656]: remote IP address x.x.x.x
Sat Sep 20 11:32:48 2014 daemon.notice pppd[1656]: primary   DNS address x.x.x.x
Sat Sep 20 11:32:48 2014 daemon.notice pppd[1656]: secondary DNS address x.x.x.x
Sat Sep 20 11:32:48 2014 daemon.notice netifd: Network device 'pppoe-wan' link is up
Sat Sep 20 11:32:48 2014 daemon.notice netifd: Interface 'wan6' is setting up now
Sat Sep 20 11:32:48 2014 daemon.notice netifd: Interface 'wan' is now up
Sat Sep 20 11:32:48 2014 user.info syslog: Usage:
        modprobe module
Sat Sep 20 11:32:48 2014 user.info syslog: Usage:
        modprobe module
Sat Sep 20 11:32:48 2014 local7.err syslog: Couldn't resolve host tic.sixxs.net, service 3874
Sat Sep 20 11:32:48 2014 local7.err syslog: Couldn't connect to the TIC server tic.sixxs.net
Sat Sep 20 11:32:48 2014 local7.err syslog: Couldn't retrieve first tunnel for the above reason, aborting
Sat Sep 20 11:32:48 2014 user.notice firewall: Reloading firewall due to ifup of wan (pppoe-wan)
Sat Sep 20 11:32:50 2014 daemon.info dnsmasq[1988]: reading /tmp/resolv.conf.auto
Sat Sep 20 11:32:50 2014 daemon.info dnsmasq[1988]: using local addresses only for domain lan
Sat Sep 20 11:32:50 2014 daemon.info dnsmasq[1988]: using nameserver x.x.x.x#53
Sat Sep 20 11:32:50 2014 daemon.info dnsmasq[1988]: using nameserver x.x.x.x#53
Sat Sep 20 11:32:50 2014 daemon.notice miniupnpd[2487]: shutting down MiniUPnPd
Sat Sep 20 11:32:50 2014 daemon.notice miniupnpd[18333]: HTTP listening on port 5000
Sat Sep 20 11:32:50 2014 daemon.notice miniupnpd[18333]: HTTP IPv6 address given to control points : [2a01:368:xxxx::1]
Sat Sep 20 11:32:50 2014 daemon.notice miniupnpd[18333]: Listening for NAT-PMP/PCP traffic on port 5351
poranje commented 9 years ago

This issue seems equivalent to https://dev.openwrt.org/ticket/17744

luizluca commented 9 years ago

As aiccu is in packages, shouldn't the issue come to github and not openwrt trac?


 Luiz Angelo Daros de Luca, Me.
        luizluca@gmail.com

2014-10-23 7:49 GMT-02:00 Paul Oranje notifications@github.com:

This issue seems equivalent to https://dev.openwrt.org/ticket/17744

— Reply to this email directly or view it on GitHub https://github.com/openwrt/packages/issues/330#issuecomment-60216015.

oskar456 commented 9 years ago

OK maybe we should check that DNS resolution is working and that system time is NTP synced before starting aiccu binary. Any idea how to do that?

poranje commented 9 years ago

A work-around is provided in https://dev.openwrt.org/ticket/17744.

sbyx commented 9 years ago

See https://dev.openwrt.org/changeset/43421 @oskar456 you could add a hotplug script to /etc/hotplug.d/ntp now.

oskar456 commented 9 years ago

OK, what about a simple hotplug script that would create a file like /tmp/clock-is-synced and wait for this file during aiccu proto startup? Or is there any better solution I'm not aware of?

luizluca commented 9 years ago

@oskar456, the suggestion of a control file is very interesting. I would improve it to be something like /tmp/.ntp_last_sync, touched every time NTP syncs and absent otherwise. Also, it would fit better inside /usr/sbin/ntpd-hotplug script. If dedicated to aiccu, it would be better to be something like /tmp/.aiccu_ntp_last_sync.

However, I think that requiring NTP sync is not the ideal solution. I know that time sync is critical for aiccu but NTP is not required for it. It might be necessary for a router without RTC but it is not for a VM with clock synced with the host machine or any hardware with an updated RTC. So, I guess waiting for NTP sync before aiccu is not ideal. In the worst case, it should be a config option.

My suggestion is that you give aiccu a try when proto is started. If it connects, it is fine (the case when clock was synced by any means, including NTP). The problem is when it fails.

In case of failure, it might be transient one (pop network unreachable, clock sync, unknown reason) or permanent (blocked user, wrong configuration, pop down). The easy solution would be to simply retry to connect at a reasonably periodic time (which is explicitly prohibit by SIXXS and you might know it already).

"pop network unreachable" and "clock sync" problems are the one that we want to deal with. If you use an external NTP server, probably you'll have internet connectivity and an updated clock when NTP syncs. Also NTP client do retry at a reasonably periodic time and might sync when internet is up.

Another interesting behavior is that aiccu talks to the POP before it forks. So, it something went wrong, it returns exitcode != 0. The problem is that it is always -1 in case of any failure (it would be much better if the exit code could inform the reason).

So, the workaround I suggest is to use the NTP sync to give aiccu a second one-time-only try. SIXXS will not block your account because of only two connections in sequence.

It would be something like this:

ifup aiccu
   ntp_synced? = file exists /tmp/.aiccu_ntp_last_sync <- must be checked before aiccu was started
   aiccu start
   if failed
       if ntp_synced?
          give up <- current behavior in case of aiccu failure
       else
          touch /tmp/.aiccu_retry_on_ntp_sync
          <<<here it would be interesting to set the interface to some transient/pending/temporary-failure state>>>
          <<<but I'm no netifd expert to know if it is possible. Anyway, simply down might also work.>>>
          <<<The only problem is that the user might not know that it must issue ifdown aiccu in order>>>
          <<<to avoid the retry.>>>
          <<< I'm only not sure if proto_block_restart applies here>>>
       end
   end

ifdown aiccu
    <<< does netifd calls this even if interface is down? >>>
    rm -f /tmp/.aiccu_retry_on_ntp_sync <- avoids a restart on a failed aiccu
    kill aiccu

/etc/hotplug.d/ntp/50-aiccu
    touch /tmp/.aiccu_ntp_last_sync
    if file exists /tmp/.aiccu_retry_on_ntp_sync
        rm -f /tmp/.aiccu_retry_on_ntp_sync
        ifup aiccu <- this will run only once as /tmp/.aiccu_ntp_last_sync will not let a new /tmp/.aiccu_retry_on_ntp_sync be recreated.
    end

The result is that:

1) if ntp does not run, aiccu behavior is kept as it is today 2) if ntp sync before aiccu, aiccu behavior is kept as it is today 3) if ntp sync after aiccu failed and before someone runs ifdown aiccu, it retries aiccu only once

This is just an idea draft as it must be improved in order to deal with multiple aiccu interfaces but this worked for me (for AA).

oskar456 commented 9 years ago

@sbyx Why is there the limitation for stratum event in /usr/sbin/ntpd-hotplug?

[ "$1" = stratum ] || exit 0

I think it limits the functionality a lot. One could use for instance the periodic event to sync RTC every 11 minutes or the desync event to know that NTP failed.

oskar456 commented 9 years ago

@luizluca Thanks for your suggestions. But I think the aiccu binary should not be tried more than once by netifd. It's quite common to reach the SixXS connnection limit only by restarting the CPE, like with an frequent grid outage. I think there should two tests prior launching the aiccu binary:

  1. if the server is pingable (it would also test if DNS works correctly)
  2. if the system clock is synced

I agree that there are some minor cases when these checks are not necessary, I'm going to implement a new option like extra_checks that would default to true.

poranje commented 9 years ago

A bit of speculative reasoning.

The reason why SIXXS's rules does not allow for automatic recalling of AICCU on connection errors can be understood when one realizes that the aiccu tool does two things. First it collects the tunnel configuration from the server and secondly it establishes the tunnel based on that retrieved information. It seems logical that the first function, the retrieval of the tunnel config, should be executed only sporadically. The second function, making the tunnel, seems one that could be executed any time the connection fails for transient reasons.

The need for DNS probably only affects the first function and a non-synced clock likely prohibits the establishment of the tunnel; a non-synced clock should in this respect be treated as a non-transient condition (i.e. do not try to establish the tunnel as long as the clock is not synced).

It would be nice if retrieved tunnel configuration data would be cached for some time so that automatic calls of aiccu do not infringe on the SIXXS rules. But that would require making some changes to aiccu.

The condition whether the clock is synced seems something that must be tested by the init.d script before calling aiccu, because that is something that is very platform dependent.

poranje commented 9 years ago

When is aiccu needed for SIXXS? (use cases) The aiccu tool allows for several connection scenario's. For some of the aiccu may not be the best tool. It seems to me that acicu is the best tool only for those of us whose router runs behind a NAT that need AYIYA or that have a dynamically changing WAN IP address that requires heartbeats. The static tunnel scenario's is well served with the 6in4 proto.

luizluca commented 9 years ago

@oskar456, I don't really remember why I opted to ignore stratum events ;-). I guess I just ignored the periodic event it is a cron job, not ntpd one. Anyway, the hotplug client script could ignore it if needed.

Now about your patch: Maybe ping is not effective as normally ntp server is external and uses dns. When it syncs, it might already mean that network is OK. I would skip ping test. So, I would change it to something like "wait_for_ntp_sync" (extrachecks gives no tips of what it means if the user does not read a help/doc). Also, wait_for_ntp_sync would be off by default, which might not help newbie users.

Now about the reconnection, I still think that the two-try strategy might work with more cases (including default config for newbies), something like my previous draft pseudocode. If ntp wasn't synced, it might be because network is still down. So, aiccu first try will never reach tic server anyway. The second try, after ntp synced, will be the only one that really reach TIC. If there is still fear of "too frequent reconnection", just check the mtime of /tmp/.aiccu_retry_on_ntp_sync and add a sleep until it makes 5 min since last try. There will never be a third try because of ntp without manual intervention.

Does anyone know the sixxs tic server connection limit? I did something like 5 in a row without being blocked. This info would be nice in order to define a "safe" sleep time.

Everything would be better if sixxs accepted some level of reconnection, conditionally on an defined exit code. Something like:

0: connected 1-127: failed and do not try again if not fixed 1: wrong password 2: time not synced 128-255: failed but you might want to retry in x seconds 128: cannot resolv name 129: cannot reach tic server

aiccu has all the info it needed in order to detect each case. However, flames were too high in the discuss about reconnecting...

oskar456 commented 9 years ago

@luizluca Rewriting the aiccu binary to act more sanely, and integrating it with netifd directly, would be the best solution, however, I'm not going to do that :)

I agree that the ping check is probably redundant and I'm considering disabling it or changing it to an opt-in option (I would appreciate an opinion here). I would probably go for an option like waitforping with number of attempt as value, default 0.

As I've already reached the SixXS limit twice during development, I'm really against your two-try strategy. The problem is that the limit is not calculated in real-time, I've got banned few hours after I finished my development. But I've figured yet another solution. When starting, wait for NTP sync for configurable timeout and after syncing or after timeout, whichever comes first, run aiccu start unconditionaly. That way, it will work for most cases out of the box without any setup, but it will also work for some corner cases like other than Busybox NTPd or x86 VM with host synced clock, with only disadvantage in a short delay, which can be avoided by configuring the timeout to zero.

I think this is the best compromise solution. I would call that option ntpsynctimeout with an integer value in seconds and I would set default to something like 90.

oskar456 commented 9 years ago

Please see the updated pull request #579 now. I've made it much more simpler and it works for me as well.