opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.28k stars 727 forks source link

WAN DHCPv4 gives up after few minutes and comes never again up #2244

Closed goor4 closed 5 years ago

goor4 commented 6 years ago

Situation:

When Cable-Modem lost link, it keeps rebooting and WAN-Ethernet-Link goes up and down. If the network-problem of the internet provider is repaired within some minutes, everything is ok.

But if the provider problem last more than ca 10 minutes, the dhcp-Client on WAN from OPNsense gives up and internet connection stays down until manual intervention/renew in WEB-GUI.

Here the chronology:

Cable Modem lost Sync, reboots

Mar 5 08:55:21 kernel: igb1: link state changed to DOWN Mar 5 08:55:21 configd.py: [7e29785b-20d7-416a-8f12-fa4bdfb85d41] Linkup stopping igb1 Mar 5 08:55:22 opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet detached event for wan Mar 5 08:55:22 opnsense: /usr/local/etc/rc.linkup: Clearing states to old gateway 109.91.104.1.

Cable Modem link comes up again

Mar 5 08:55:41 kernel: igb1: link state changed to UP Mar 5 08:55:41 configd.py: [434f0198-6edb-441d-a6c1-fa097bd702b0] Linkup starting igb1 Mar 5 08:55:42 opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet attached event for wan Mar 5 08:55:42 opnsense: /usr/local/etc/rc.linkup: HOTPLUG: Configuring interface wan Mar 5 08:55:46 opnsense: /usr/local/etc/rc.newwanip: On (IP address: 192.168.100.10) (interface: WAN[wan]) (real interface: igb1). Mar 5 08:55:46 opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'igb1'

OPNSense get's private IP, because Cable-Modem has no sync

Mar 5 08:55:47 opnsense: /usr/local/etc/rc.newwanip: ROUTING: setting IPv4 default route to 192.168.100.1

Mar 5 08:55:57 opnsense: /usr/local/etc/rc.newwanip: Aborted IP detection: Resolving timed out after 5689 milliseconds Mar 5 08:57:04 opnsense: /usr/local/etc/rc.dyndns: Aborted IP detection: Resolving timed out after 5628 milliseconds Mar 5 08:57:05 opnsense: /usr/local/etc/rc.newwanip: Aborted IP detection: Resolving timed out after 5630 milliseconds Mar 5 08:57:41 configd_ctl.py: error in configd communication Traceback (most recent call last): File "/usr/local/opnsense/service/configd_ctl.py", line 65, in exec_config_cmd line = sock.recv(65536) timeout: timed out

Next Reboot of Cablemodem and get's private IP

Mar 5 08:58:11 opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'igb1' Mar 5 08:58:11 opnsense: /usr/local/etc/rc.newwanip: On (IP address: 192.168.100.10) (interface: WAN[wan]) (real interface: igb1).

PROBLEM: DHCP-Renew gives up:

Mar 5 08:58:51 kernel: igb1: link state changed to UP Mar 5 08:58:51 configd.py: [d1ad4663-1ccd-4997-a240-7c4d9777e122] Linkup starting i Mar 5 08:58:51 opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet attached event for wan Mar 5 08:58:51 opnsense: /usr/local/etc/rc.linkup: HOTPLUG: Configuring interface wan Mar 5 08:58:52 opnsense: /usr/local/etc/rc.linkup: The command '/sbin/dhclient -c /var/etc/dhclient_wan.conf igb1 > /tmp/igb1_output 2> /tmp/igb1_error_output' returned exit code '1', the output was '' Mar 5 08:59:23 configd.py: unable to sendback response [OK ] for [interface][linkup][['start', 'igb1']] {434f0198-6edb-441d-a6c1-fa097bd702b0}, message was Traceback (most recent call last): File "/usr/local/opnsense/service/modules/processhandler.py", line 202, in run self.connection.sendall('%s\n' % result) File "/usr/local/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(*args) error: [Errno 32] Broken pipe

ERROR:

Cable-Modem and WAN-Ethernetlink was multiple times putted down and up

OPNsense will do no more dhcp-client, internet keeps down

Manual Renew with via OPNsense Web-Interface:

Interfaces -> Overview -> WAN:

DHCP was down, klick on "renew"

Mar 5 19:41:20 opnsense: /index.php: Successful login for user 'root' from: 192.168.33.162 Mar 5 19:42:02 opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'igb1' Mar 5 19:42:02 opnsense: /usr/local/etc/rc.newwanip: On (IP address: 109.91.xx.yy) (interface: WAN[wan]) (real interface: igb1). ...

Internet now works

fichtner commented 6 years ago

You can avoid taking leases from the modem adding 192.168.100.1 ? to the DHCP configuration.

The underlying issue, however, is that once the link is up and dhcp times out there is no activity on the link when it comes back, so you would have to rely on an active monitoring of that line to bring it back. You should be able to do this via system: settings: gateway, enabling the gateway monitoring.

goor4 commented 6 years ago

The problem is, that the ethernet-link physically goes down, e.g. because of modem power down. If the WAN-etnernet-link goes up again, OPNsense should then request via DHCP allways a new IP. If it does, my problem is solved.

It seems, that OPNsense doesn't see that ethernet-link-up event or react after 10 minutes error.

fichtner commented 6 years ago

It seems, that OPNsense doesn't see that ethernet-link-up event or react after 10 minutes error.

My guess is the modem does not toggle the link state several minutes in when the line is up and running again and/or giving out a lease time that is far too long for dhclient to renegotiate after a brief outage.

goor4 commented 6 years ago

The Modem does toggle the link state often in 10 hours. At least around 19:30 I switched manually of the power of on the entire modem for minutes. The ethernet-link was definite down for ca. a minute at that time and i switched power again on, OPNsense doesn't react:

fichtner commented 6 years ago

Does it work if you unplug the cable?

goor4 commented 6 years ago

No, Unplugging was also not detected. I even powered down the whole floor of the modem by fuse.

goor4 commented 6 years ago

Is it possible, that a bug in igb-driver (Intel i211-AT) with EEE (Energy Efficient Ethernet) is the source of that problem? Here is a good workaround for that problem (in german): https://www.thomas-krenn.com/de/wiki/OPNsense_igb_EEE_Funktion_deaktivieren

It only affects OPNsense 17.1.9 to 17.7.

Yesterday i had those modem-reboots again with OPNsense 18.1.6 and the router could detect the Link-up and down correctly and get's a IP-adress again:

Apr 10 18:54:56         kernel: igb1: link state changed to DOWN
Apr 10 18:55:09         kernel: igb1: link state changed to UP
...
Apr 10 18:55:15 opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting
      on 'igb1'
Apr 10 18:55:15 opnsense: /usr/local/etc/rc.newwanip: On
      (IP address: 123.123.123.123) (interface: WAN[wan]) (real interface: igb1).
goor4 commented 6 years ago

The issue has happened again on Apr 30:

Apr 30 14:03:45     kernel: igb1: link state changed to UP
...
Apr 30 14:03:46     opnsense: /usr/local/etc/rc.linkup: The command 
                        '/sbin/dhclient -c /var/etc/dhclient_wan.conf igb1 > 
                        /tmp/igb1_output 2> /tmp/igb1_error_output' returned
                        exit code '1', the output was ''
...
Apr 30 14:03:50     opnsense: /usr/local/etc/rc.dyndns: Dynamic DNS 
                        (xxx.yyyy.org) There was an error trying to determine 
                        the public IP for interface - wan(igb1). Probably interface is
                        not a WAN interface.

After that, the link went down and up again, but that event was noch detected by OPNsense 18.1.6-amd64.

So i implemented the following workarount from Thomas-Krenn by disabling EEE (Energy Efficient Ethernet) on all interfaces:

goor4 commented 6 years ago

Problem appeared again in actual OPNsense 18.1.6-amd64.

fichtner commented 6 years ago

See https://github.com/opnsense/core/issues/2372 ... you can try:

# opnsense-patch 07746e3
goor4 commented 6 years ago

Thank you fichtner. I don't apply your patch for now, because i will test, if disabling EEE in the NIC solves the problem. If the problem occurs again, i will enable EEE again and install your patch.

So we can clearly locate the issue.

fichtner commented 6 years ago

Sounds good, thanks! :)

fichtner commented 6 years ago

Patch is included in 18.1.7, but needs a reboot to apply in your case.

goor4 commented 6 years ago

ok, then I flip it:

  1. I deleted tunables dev.igb.0.eee_disabled etc.
  2. I upgraded to 18.1.7 (including opnsense-patch 07746e3) & rebooted

If the problem occours again, I will disable EEE again and report it here.

fichtner commented 6 years ago

thanks 👍

goor4 commented 6 years ago

The connection problem of the cable modem occurred again and DHCP-Client of OPNsense now self-healed without manual intervention in OPNsense.

Summary:

  1. System -> Settings -> Tunables "dev.igb.0.eee_disabled" is not necessary
  2. In OPNsense 18.1.10 the problem seems to be fixed, see ticket #2372 for more details
goor4 commented 6 years ago

The issue is back.

Today OPNsense 18.7.2-amd64 didn't get a dhcp-address again after modem Link-Up/Down.

One possible error message was: opnsense: /usr/local/etc/rc.linkup: The command '/sbin/dhclient -c '/var/etc/dhclient_wan.conf' -p '/var/run/dhclient.igb1.pid' 'igb1'' returned exit code '15', the output was 'dhclient: PREINIT dhclient: Starting delete_old_states() dhclient: Comparing IPs: Old: <public-ip> New: dhclient: Removing states from old IP '<public-ip>' (new IP '') 0 states cleared killed 0 src nodes from 1 sources and 0 destinations DHCPDISCOVER on igb1 to <public-ip> port 67 interval 11 DHCPDISCOVER on igb1 to 255.255.255.255 port 67 interval 24 DHCPOFFER from 10.198.192.1 DHCPREQUEST on igb1 to 255.255.255.255 port 67 DHCPACK from 10.198.192.1'

goor4 commented 5 years ago

The issue is still there:

Therefore it is no issue with EEE on Intel 210 but a dhcp-client issue.

Here are the current commented logs:

--- dhcp-client get's a public ip-adress an internet-link is now up
Oct 29 22:24:18     opnsense: /status_interfaces.php: ROUTING: skipping IPv6 default route
Oct 29 22:24:18     opnsense: /status_interfaces.php: ROUTING: keeping current default gateway '109.91.x.x'
Oct 29 22:24:18     opnsense: /status_interfaces.php: ROUTING: setting IPv4 default route to 109.91.104.1
Oct 29 22:24:18     opnsense: /status_interfaces.php: ROUTING: no IPv6 default gateway set, assuming wan
Oct 29 22:24:18     opnsense: /status_interfaces.php: ROUTING: IPv4 default gateway set to wan
Oct 29 22:24:18     opnsense: /status_interfaces.php: ROUTING: entering configure using 'wan'
Oct 29 22:24:18     opnsense: /status_interfaces.php: Accept router advertisements on interface igb1
Oct 29 22:24:06     opnsense: /usr/local/etc/rc.newwanip: ROUTING: skipping IPv6 default route
Oct 29 22:24:06     opnsense: /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway '109.91.x.x'
Oct 29 22:24:06     opnsense: /usr/local/etc/rc.newwanip: ROUTING: setting IPv4 default route to 109.91.104.1
Oct 29 22:24:06     opnsense: /usr/local/etc/rc.newwanip: ROUTING: no IPv6 default gateway set, assuming wan
Oct 29 22:24:06     opnsense: /usr/local/etc/rc.newwanip: ROUTING: IPv4 default gateway set to wan
Oct 29 22:24:06     opnsense: /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'wan'
Oct 29 22:24:06     opnsense: /usr/local/etc/rc.newwanip: On (IP address: 109.91.x.x) (interface: WAN[wan]) (real interface: igb1).
Oct 29 22:24:06     opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'igb1'
--- Starting manual(!!) dhcp-Renew via WebGUI (Interfaces -> Overwiew; WAN, Reload)
  (no automatic DHCP-Client renew for 13 hours)
Oct 29 09:29:08     opnsense: /usr/local/etc/rc.dyndns: Aborted IPv4 detection: no address for igb1
...
Oct 29 09:28:58     opnsense: /usr/local/etc/rc.newwanip: Aborted IPv4 detection: Resolving timed out after 5002 milliseconds
Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: ROUTING: skipping IPv6 default route
Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: ROUTING: skipping IPv4 default route
Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: ROUTING: no IPv6 default gateway set, assuming wan
Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: ROUTING: IPv4 default gateway set to wan
Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: ROUTING: entering configure using 'wan'
Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: Accept router advertisements on interface igb1
Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: The command '/sbin/dhclient -c '/var/etc/dhclient_wan.conf' -p '/var/run/dhclient.igb1.pid' 'igb1'' returned exit code '1', the output was 'dhclient already running, pid: 60793. exiting.'
--- dhclient should request a new public IP-Adress and replace it with the
--- local address 192.168.100.1
--- but dhclient exits and the local adress stays the same until manual renew!
Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: HOTPLUG: Configuring interface wan
Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet attached event for wan
Oct 29 09:28:55     opnsense: /usr/local/etc/rc.linkup: Clearing states to old gateway 192.168.100.1.
Oct 29 09:28:55     opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet detached event for wan
Oct 29 09:28:55     configd_ctl.py: error in configd communication Traceback (most recent call last): File "/usr/local/opnsense/service/configd_ctl.py", line 65, in exec_config_cmd line = sock.recv(65536) timeout: timed out
Oct 29 09:28:37     kernel: arpresolve: can't allocate llinfo for 62.143.130.1 on igb1
Oct 29 09:28:36     kernel: igb1: link state changed to UP
---- Interface wents down and up, but no dhcp renew
Oct 29 09:28:23     kernel: igb1: link state changed to DOWN
Oct 29 09:28:23     opnsense: /usr/local/etc/rc.newwanip: ROUTING: skipping IPv6 default route
Oct 29 09:28:23     opnsense: /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway '192.168.100.1'
Oct 29 09:28:23     opnsense: /usr/local/etc/rc.newwanip: ROUTING: setting IPv4 default route to 192.168.100.1
Oct 29 09:28:23     opnsense: /usr/local/etc/rc.newwanip: ROUTING: no IPv6 default gateway set, assuming wan
Oct 29 09:28:23     opnsense: /usr/local/etc/rc.newwanip: ROUTING: IPv4 default gateway set to wan
Oct 29 09:28:23     opnsense: /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'wan'
Oct 29 09:28:22     opnsense: /usr/local/etc/rc.newwanip: On (IP address: 192.168.100.10) (interface: WAN[wan]) (real interface: igb1).
Oct 29 09:28:22     opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'igb1'
Oct 29 09:28:19     opnsense: /usr/local/etc/rc.linkup: ROUTING: skipping IPv6 default route
Oct 29 09:28:19     opnsense: /usr/local/etc/rc.linkup: The command '/sbin/route add -'inet' default '192.168.100.1'' returned exit code '1', the output was 'add net default: gateway 192.168.100.1 fib 0: route already in table'
Oct 29 09:28:19     opnsense: /usr/local/etc/rc.linkup: ROUTING: creating /tmp/igb1_defaultgw using '192.168.100.1'
Oct 29 09:28:19     opnsense: /usr/local/etc/rc.linkup: ROUTING: setting IPv4 default route to 192.168.100.1
Oct 29 09:28:19     opnsense: /usr/local/etc/rc.linkup: ROUTING: no IPv6 default gateway set, assuming wan
Oct 29 09:28:19     opnsense: /usr/local/etc/rc.linkup: ROUTING: IPv4 default gateway set to wan
Oct 29 09:28:19     opnsense: /usr/local/etc/rc.linkup: ROUTING: entering configure using 'wan'
Oct 29 09:28:18     opnsense: /usr/local/etc/rc.linkup: Accept router advertisements on interface igb1
...
Oct 29 09:27:11     opnsense: /usr/local/etc/rc.newwanip: ROUTING: IPv4 default gateway set to wan
Oct 29 09:27:11     opnsense: /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'wan'
Oct 29 09:27:10     opnsense: /usr/local/etc/rc.newwanip: On (IP address: 192.168.100.10) (interface: WAN[wan]) (real interface: igb1).
Oct 29 09:27:10     opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'igb1'
Oct 29 09:26:55     opnsense: /usr/local/etc/rc.linkup: HOTPLUG: Configuring interface wan
Oct 29 09:26:55     opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet attached event for wan
Oct 29 09:26:54     kernel: igb1: link state changed to UP
--- Modem reboots, ethernet-link goes up, because modem has no sync, it give a
--- private IPv4 in the Range 192.168.100.0/24, to keep the Modem Webinterface
--- at 192.168.100.1 accessible
Oct 29 09:26:45     opnsense: /usr/local/etc/rc.dyndns: Curl error occurred: Resolving timed out after 15002 milliseconds
Oct 29 09:26:44     configd_ctl.py: error in configd communication Traceback (most recent call last): File "/usr/local/opnsense/service/configd_ctl.py", line 65, in exec_config_cmd line = sock.recv(65536) timeout: timed out
Oct 29 09:26:36     opnsense: /usr/local/etc/rc.linkup: Clearing states to old gateway 62.143.130.1.
Oct 29 09:26:35     opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet detached event for wan
Oct 29 09:26:35     kernel: igb1: link state changed to DOWN
Oct 29 09:26:18     opnsense: /usr/local/etc/rc.dyndns: Curl error occurred: Resolving timed out after 15005 milliseconds
--- Cable-Modem lost sync, ethernet-link went down 
Oct 29 06:50:27     opnsense: /usr/local/etc/rc.newwanip: On (IP address: 62.143.x.x) (interface: WAN[wan]) (real interface: igb1).
--- dhcp-client gets it's last public IPv4 and has internet access
hirschferkel commented 5 years ago

Maybe I suffer the same issue? https://forum.opnsense.org/index.php?topic=10175.0

And I have no idea how to work around...

goor4 commented 5 years ago

Thank you hirschferkel.

Comment on: https://forum.opnsense.org/index.php?topic=10175.0

I think, there was more than one issue with that.

I think, that the remaining problem has to do with the following Log-Line:

Oct 29 09:28:56     opnsense: /usr/local/etc/rc.linkup: The command '/sbin/dhclient -c '/var/etc/dhclient_wan.conf' -p '/var/run/dhclient.igb1.pid' 'igb1'' returned exit code '1', the output was 'dhclient already running, pid: 60793. exiting.'

Should stop the dhclient-process on Link-Down and it was still running on my system as link goes up?

goor4 commented 5 years ago

DHCP-Error occured again on OPNsense 18.7.8 at 2018-12-12 10:30

The Log-Message was:

opnsense: /usr/local/etc/rc.configure_interface: 
The command '/sbin/dhclient -c '/var/etc/dhclient_wan.conf' -p '/var/run/dhclient.igb1.pid' 'igb1''
 returned exit code '15', 
the output was 'dhclient: PREINIT dhclient: 
Starting delete_old_states() dhclient: Comparing IPs: Old: 109.91.123.123 
New: dhclient: Removing states from old IP '109.91.123.123' (new IP '') 
0 states cleared killed 0 src nodes from 1 sources and 0 destinations 
DHCPREQUEST on igb1 to 255.255.255.255 port 67 DHCPNAK from 192.168.100.1 
DHCPDISCOVER on igb1 to 255.255.255.255 port 67 interval 13 DHCPOFFER from 192.168.100.1 
DHCPREQUEST on igb1 to 255.255.255.255 port 67 DHCPACK from 192.168.100.1'
fichtner commented 5 years ago

I have no idea what exit code '15' is and where it comes from. It really can't be ENOTBLK and according to the output it gets an ACK from 192.168.100.1 but gets stuck afterwards.

fichtner commented 5 years ago

We should probably close this in favour of a feature that will be able to handle this as I don't see how this can be fixed if dhclient just exits: https://github.com/opnsense/core/issues/2517

hirschferkel commented 5 years ago

In my case the problem was in connection with a Fritz!box in front. Despite the fact that all ports were open and dhcp was enabled, I always got stuck with OPNsens. After I just switched to bridged mode an Fritz!box everything run perfectly again...+ Best, hirschferkel

goor4 commented 5 years ago

I think, that those frequent toggling of the WAN-Interface with different dhcp-leases (local and public IP-addresses) or disappear from etnernet-interface damages dhclient ?

goor4 commented 5 years ago

The feature in #2517 looks interesting, since interface up/down then events are then irrelevant. Also if you don't get any up/down-Events because of active network in between like switch or media converter ... (this is not my case, but ...)

fichtner commented 5 years ago

Closing this in favour of implementing #2517 as time permits.